# NLP and Reddit Subcommunities - Data Acquisition
---

## Problem Statement

The snowboarding and marathon running communities represent vibrant and rapidly growing segments within the sports industry. The winter season in 2021 marked the highest number of active skiers and snowboarders in over 25 years, with a 26% increase in less than a decade; and for the first time, the directionality in interest for snowboarding outpaced that of skiing ([1](https://kenver.com/blogs/news/state-snow-sports)). Between 2008 and 2018, women's and men's marathon participation increased by 56.83% and 46.91%, respectively ([2](https://www.livestrong.com/article/13763749-marathon-statistics/)). 

These communities exhibit unique and shared preferences, needs, and desires, presenting a meaningful opportunity for sports apparel companies to cater to their specific requirements. By understanding, exploring, and attempting to address the characteristics of these communities, sports apparel companies can capitalize on the opportunity to meet their diverse needs effectively, and members of the community themselves may maximize their performance by choosing the right brands, apparel, and accessories for them. 

For brands who would benefit from gaining more of the market share of these communities, such as Under Armour, Inc., this knowledge will empower them to develop tailored product offerings, refine marketing strategies, and strengthen their brand presence within these niche markets.  

In this analysis we will.... We will start in this notebook gathering our data via the PRAW (Python Reddit API Wrapper) API package, exploring and cleaning our data ([second notebook](../code/02_Feature_Engineering.ipynb)), and transforming our variables into model-ready form whilst modeling across different classification options to ultimately find the best model that predicts which community a Reddit post came from ([third notebook](../code/03_Model_Building.ipynb)). At the end of this assignment, Under Armour will better understand how to leverage opportunities based on what the customers within these communities want. Ultimately, this should help them establish themselves as the preferred choice for sports apparel among snowboarders and marathon runners, in turn driving business growth and fostering lasting customer relationships.

For more information on the background, a summary of methods, and findings, please see the associated [README](../Farah_Malik_Proj2_README.md) for this analysis.

### Contents:

In [1]:
#pip install praw

In [2]:
import os
os.getcwd()

'C:\\Users\\farah\\Documents\\General Assembly DSI\\DSI-508\\Projects\\project-3\\code'

In [3]:
os.chdir('C:/Users/farah/Documents/General Assembly DSI/DSI-508/Projects/project-3/code')
os.getcwd()

'C:\\Users\\farah\\Documents\\General Assembly DSI\\DSI-508\\Projects\\project-3\\code'

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import time
import re # source: https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
import string

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

import praw
from creds import secrets

In [5]:
# Execute PRAW
reddit = praw.Reddit(
    client_id=secrets.get('client_id'),
    client_secret=secrets.get('client_secret'),
    user_agent=secrets.get('user_agent'),
    username=secrets.get('username'),
    password=secrets.get('password')
)

### Choosing Posts
##### I had two options for pulling Reddit posts: 
1. <u>Method 1</u>: Pulling the Top 700 newest posts and Top 500 "top" posts which occurred earlier than the newest posts - focus is on the new and top posts and the using the dates there should be no or little overlap.
2. <u>Method 2</u>:  Pulling Top 1000 posts from New, Hot, Top, and Controversial channels and de-duplicating - focus is on variety and getting more posts.

After trying both methods, **Method 2** was chosen. Therefore, Method 1 cells below have been have been made Raw so they will not be executed/run.

#### Method 1 - Pulling Posts (Tested --> Not Chosen --> Commented Out Using Raw Cells)

#### Method 2 - Pulling Posts

In [6]:
# METHOD 2: To maximize number of posts among new, hot, top, and controversial, will pull from each and then dedupe --> Opting to leverage code given by Tim, source: Tim Book
def pull_data(posts, label):
        
    data = []
    for post in posts:
        data.append([post.created_utc, post.author, post.title, post.selftext, post.score, post.upvote_ratio, post.num_comments, post.subreddit])
        min_time = int(min(r[0] for r in data)) - 100000
    print(f'{label.upper()} POSTS: N = {len(data)}')
    return data

In [7]:
def comm(community):
    subreddit = reddit.subreddit(community)

    posts_new = subreddit.new(limit = 1000)
    posts_hot = subreddit.hot(limit = 1000)
    posts_top = subreddit.top(limit = 1000)
    posts_con = subreddit.controversial(limit = 1000)
    
    return posts_new, posts_hot, posts_top, posts_con

In [8]:
# SNOWBOARDING REDDIT
posts_new, posts_hot, posts_top, posts_con = comm('snowboardingnoobs')
data_new = pull_data(posts_new, 'new')
data_top = pull_data(posts_top, 'top')
data_hot = pull_data(posts_hot, 'hot')
data_con = pull_data(posts_con, 'controversial')
snow = pd.DataFrame(data_new + data_hot + data_top + data_con, columns = ['created_utc', 'author', 'title', 'selftext', 'score', 'upvote_ratio', 'num_comments', 'subreddit'])
snow.drop_duplicates(subset=['title', 'selftext'], inplace=True)
snow.shape

NEW POSTS: N = 993
TOP POSTS: N = 1000
HOT POSTS: N = 1000
CONTROVERSIAL POSTS: N = 999


(2753, 8)

In [9]:
snow.to_csv('../data/snowboarding2.csv', index=False)

In [10]:
# SKIING REDDIT
posts_new, posts_hot, posts_top, posts_con = comm('skiing')
data_new = pull_data(posts_new, 'new')
data_top = pull_data(posts_top, 'top')
data_hot = pull_data(posts_hot, 'hot')
data_con = pull_data(posts_con, 'controversial')
ski = pd.DataFrame(data_new + data_hot + data_top + data_con, columns = ['created_utc', 'author', 'title','selftext', 'score', 'upvote_ratio', 'num_comments', 'subreddit'])
ski.drop_duplicates(subset=['title', 'selftext'], inplace=True)
ski.shape

NEW POSTS: N = 973
TOP POSTS: N = 1000
HOT POSTS: N = 680
CONTROVERSIAL POSTS: N = 992


(2887, 8)

In [11]:
ski.to_csv('../data/skiing2.csv', index=False)

### Exploring/Cleaning Snowboarding DataFrame

In [12]:
# Concatenate Title and Selftext
snow['text'] = snow['title'].str.lower().str.strip() + " " + snow['selftext'].str.lower().str.strip()

In [13]:
# Look for Missing Texts
snow.text.isnull().sum()

0

In [14]:
# Look for Other Missing Variables
snow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2753 entries, 0 to 3990
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   created_utc   2753 non-null   float64
 1   author        2645 non-null   object 
 2   title         2753 non-null   object 
 3   selftext      2753 non-null   object 
 4   score         2753 non-null   int64  
 5   upvote_ratio  2753 non-null   float64
 6   num_comments  2753 non-null   int64  
 7   subreddit     2753 non-null   object 
 8   text          2753 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 215.1+ KB


In [18]:
# Look for Duplicates
snow['text'].value_counts()

too much overhang?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [19]:
# Seeing a few duplicates due to difference in upper vs lower case - will dedupe one more time, post-all-lower-casing, for good measure
snow[snow['text'] == "too much overhang?"]

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text


In [20]:
snow.drop_duplicates(subset=['text'], inplace=True)
snow['text'].value_counts()

my buddy’s new display                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [21]:
snow.shape

(2752, 9)

In [22]:
# Confirm no nulls - blank may not be captured as NaN or None
snow[snow['text'] == ""]

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text


In [23]:
# Confirm no nulls - blank may not be captured as NaN or None
snow[snow['text'] == " "]

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text


In [24]:
# Look at text for anomalous characters
pd.set_option('display.max_rows', None)
snow.text.value_counts()[:15]

my buddy’s new display                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [25]:
# Get Rid of Line Breaks
snow['text'] = snow['text'].str.replace('\n', '')
snow.text.value_counts()[:15]

my buddy’s new display                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [26]:
# Remove any other html code --> Do Not End Up Using
# source: https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
        # https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
rmv_html = r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

In [27]:
# Function for Lemmatizing
def lemmatize_txt(text):
       
    # Remove Punctuation --> Unnecessary, RegexpTokenizer Handles This
    # adapted from: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
    #no_punc = [s for s in split_txt if s not in string.punctuation]
    #txt_f = ' '.join(no_punc)
        
    # Remove HTML Pieces --> source: https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105; https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
    # Did Not Need
    #text = re.sub(rmv_html, '', text)
        
    # Tokenize Into Words
    #split_txt = text.split()
    tokenizer = RegexpTokenizer('\w+')
    split_txt = tokenizer.tokenize(text)

    # Instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()
        
    # Lemmatize and Rejoin
    return ' '.join([lemmatizer.lemmatize(word) for word in split_txt])

In [28]:
%%time

# Apply Lemmitazation - Create New Column w/ Lemmatized Results for EDA
snow['lem_text'] = snow['text'].apply(lemmatize_txt)
snow.lem_text.value_counts()[:10]

CPU times: total: 625 ms
Wall time: 1.77 s


too much toe overhang                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [29]:
# Function for Stemming
def stem_txt(text):
    
    # Remove HTML Pieces --> source: https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105; https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
    # Did Not Need
    #text = re.sub(rmv_html, '', text)

    # Tokenize Into Words
    #split_txt = text.split(' ')
    tokenizer = RegexpTokenizer('\w+')
    split_txt = tokenizer.tokenize(text)

    # Instantiate Stemmer
    p_stemmer = PorterStemmer()

    # Stem and Rejoin
    return ' '.join([p_stemmer.stem(word) for word in split_txt])

In [30]:
%%time

# Apply Stemming - Create New Column w/ Stemmed Results for EDA
snow['stem_text'] = snow['text'].apply(stem_txt)
snow.stem_text.value_counts()[:10]

CPU times: total: 719 ms
Wall time: 1.88 s


too much toe overhang                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [31]:
snow.head()

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text,lem_text,stem_text
0,1686703000.0,dea_ton,My buddy’s new display,,0,0.5,1,snowboardingnoobs,my buddy’s new display,my buddy s new display,my buddi s new display
1,1686690000.0,Kerlebsky,What do I do about a chip like this?,,3,0.8,13,snowboardingnoobs,what do i do about a chip like this?,what do i do about a chip like this,what do i do about a chip like thi
2,1686681000.0,adknkd,Rome Katana v Bataleon Astro Asym?,A post on SnowboardingForum said these were co...,3,1.0,2,snowboardingnoobs,rome katana v bataleon astro asym? a post on s...,rome katana v bataleon astro asym a post on sn...,rome katana v bataleon astro asym a post on sn...
3,1686675000.0,Alarmed_Cranberry313,How to stop boots from smelling (No dry rack),Im about to go to my seasonal job at a snow re...,5,1.0,15,snowboardingnoobs,how to stop boots from smelling (no dry rack) ...,how to stop boot from smelling no dry rack im ...,how to stop boot from smell no dri rack im abo...
4,1686640000.0,th04r_,Lamar Whisper?,Hi all! I recently picked up a used Lamar Whis...,3,0.81,5,snowboardingnoobs,lamar whisper? hi all! i recently picked up a ...,lamar whisper hi all i recently picked up a us...,lamar whisper hi all i recent pick up a use la...


### Exploring/Cleaning Skiing DataFrame

In [32]:
# Concatenate Title and Selftext
ski['text'] = ski['title'].str.lower().str.strip() + " " + ski['selftext'].str.lower().str.strip()

In [33]:
# Check for Missing Texts
ski.text.isnull().sum()

0

In [34]:
# Check for Other Missing Variables
ski.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2887 entries, 0 to 3643
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   created_utc   2887 non-null   float64
 1   author        2784 non-null   object 
 2   title         2887 non-null   object 
 3   selftext      2887 non-null   object 
 4   score         2887 non-null   int64  
 5   upvote_ratio  2887 non-null   float64
 6   num_comments  2887 non-null   int64  
 7   subreddit     2887 non-null   object 
 8   text          2887 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 225.5+ KB


In [35]:
ski['text'].value_counts()[:15]

learned to 360 at 53 this was the first day i felt i could throw a 3 consistently after several months of tiny progressions and getting a few 3’s along the way.    this was the first batch of 3s where i had air awareness and was actually seeing the horizon and the landing.    \n\ni kinda was forced to do them over and over again this day as each time i recruited a random stranger to get my first video they botched it 😂 and i had to go do it again. thanks brian from co for getting this.. the only one i have ever had recorded.  also thanks mammoth lifty who out of the blue told me he had been watching me over a couple days and i was going to get “it.”  dude you seemed genuinely invested and interested and it was appreciated. it’s not easy trying to learn this stuff in your 50s and it’s a bit lonely at times. \n\ni see a lot of older skiers (i sometimes have to laugh when they are 32 acting like they have accomplished all they can😂) commenting under 360 posts on here about how they “day d

In [36]:
# Confirm no nulls - blank may not be captured as NaN or None
ski[ski['text'] == ""]

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text


In [37]:
# Confirm no nulls - blank may not be captured as NaN or None
ski[ski['text'] == " "]

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text


In [38]:
# Get Rid of Line Breaks
ski['text'] = ski['text'].str.replace('\n', '')
ski.text.value_counts()[:15]

learned to 360 at 53 this was the first day i felt i could throw a 3 consistently after several months of tiny progressions and getting a few 3’s along the way.    this was the first batch of 3s where i had air awareness and was actually seeing the horizon and the landing.    i kinda was forced to do them over and over again this day as each time i recruited a random stranger to get my first video they botched it 😂 and i had to go do it again. thanks brian from co for getting this.. the only one i have ever had recorded.  also thanks mammoth lifty who out of the blue told me he had been watching me over a couple days and i was going to get “it.”  dude you seemed genuinely invested and interested and it was appreciated. it’s not easy trying to learn this stuff in your 50s and it’s a bit lonely at times. i see a lot of older skiers (i sometimes have to laugh when they are 32 acting like they have accomplished all they can😂) commenting under 360 posts on here about how they “day dream” of

In [39]:
%%time

# Apply Lemmitazation - Create New Column w/ Lemmatized Results for EDA
ski['lem_text'] = ski['text'].apply(lemmatize_txt)
ski.lem_text.value_counts()[:10]

CPU times: total: 15.6 ms
Wall time: 268 ms


wait for it                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

In [40]:
%%time

# Apply Stemming - Create New Column w/ Stemmed Results for EDA
ski['stem_text'] = ski['text'].apply(stem_txt)
ski.stem_text.value_counts()[:10]

CPU times: total: 15.6 ms
Wall time: 1 s


wait for it                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

In [41]:
# Duplicates after Lemming and Stemming, Take a Look at an Example

ski[ski.stem_text == 'wait for it']
# Duplicates after stemming/lemmatizing are okay - they are coming from different post/authors

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text,lem_text,stem_text
2184,1598200000.0,snusmumien,Wait for it..,,1789,1.0,57,skiing,wait for it..,wait for it,wait for it
2395,1637766000.0,shredmeister404,Wait for it,,1558,0.97,56,skiing,wait for it,wait for it,wait for it


In [42]:
ski.head()

Unnamed: 0,created_utc,author,title,selftext,score,upvote_ratio,num_comments,subreddit,text,lem_text,stem_text
0,1686693000.0,OLIIIIIEVR,Curated referral code c-Oliver-591,,0,0.13,0,skiing,curated referral code c-oliver-591,curated referral code c oliver 591,curat referr code c oliv 591
1,1686677000.0,Asby2151,Are monoskis anymore dangerous than regular sk...,,5,0.73,9,skiing,are monoskis anymore dangerous than regular sk...,are monoskis anymore dangerous than regular sk...,are monoski anymor danger than regular ski for...
2,1686672000.0,IBTaylor,A big steep line I sent this winter.,,2311,0.94,175,skiing,a big steep line i sent this winter.,a big steep line i sent this winter,a big steep line i sent thi winter
3,1686665000.0,avalanchepacifist,Vail Sun Down Express: A Game Changer for the ...,,19,0.77,3,skiing,vail sun down express: a game changer for the ...,vail sun down express a game changer for the b...,vail sun down express a game changer for the b...
4,1686623000.0,bradbrookequincy,Learned to 360 at 53,This was the first day I felt I could throw a ...,440,0.98,38,skiing,learned to 360 at 53 this was the first day i ...,learned to 360 at 53 this wa the first day i f...,learn to 360 at 53 thi wa the first day i felt...


### Save Dataset for EDA

In [43]:
combo = pd.concat([snow, ski])

In [46]:
combo.to_csv('../data/Clean/snow_ski.csv', index=False)