# Problem Statement and Data Collection #

## Problem Statement ##

The marketing mavens at ABC Toy Co. are dealing with a surprising challenge. It turns out that people who have small children (ABC's target demographic), and people who have dogs tend to talk about their... dependents in similar ways. As a result, ABC has wasted a fortune in internet advertising that turned out to be targeted towards people who don't have kids at all. In order to eliminate this wasteful spending, and better target their advertising towards the people most likely to buy their children's products, ABC has commissioned a data scientist to build a machine learning model that can distinguish between those who have actual human children, and those whose babies are of the fur variety.

While ABC is concerned about wasteful spending, they still view missing out on reaching potential customers as the greater evil. The ideal model, therefore, will optimize for **sensitivity**, while also taking **accuracy** into consideration.

### Methodology ###

#### Data ####

In order to train my models, I'll be looking at posts from two subreddits, one each on children and dogs. Specifically: 
- r/parenting
- r/dogs

I chose these two subreddits because their tone is similar, with posts largely concerning practical issues and advice related to the raising and rearing of children and dogs, respectively.

By accessing the **pushshift/api**, which is an api for **Reddit** posts and comments, I will assemble a sample dataset of 4000 total posts, 2000 from each subreddit. My final dataset will consist of these posts after culling duplicates.

The goal of my models will be to correctly identify which of the two subreddits each post comes from. Specifically, I'm interested in correctly identifying posts from the **r/parenting** subreddit, as those posting to this subreddit are potential customers that my client is interested in reaching.

#### Exploratory Data Analysis ####

During EDA I'll look into some of the characteristics of my data. Some things I'll look into:
- Post length:
    - How long is the average post?
    - Is there a relationship between post length and which subreddit the post comes from?
- Common Words:
    - After removing some meaningless words, what are some of the most common words used, both overall and by subreddit?
    - How often do words that are obvious differentiators appear?
        - *eg. how often does the word 'dog' appear in posts in the **r/dogs** subreddit*
        - I'll be looking to build a custom list of **crutch words** out of these common differentiators, for use in my models

#### Modelling ####

I'll build four different types of models:

1. Logistic Regression
2. Random Forest
3. Gradient Boost
4. Support Vector Machine

For each model type I'll build two sets of models:

1. The first set will be given the corpus of data with the 'Crutch Words' still included
2. The second will be given data with the 'Crutch Words' removed

For each model I'll use GridSearch to determine best parameters

#### Model Analysis ####

After I have run all my models, I'll conduct analysis to determine which models performed the best. I'll also take a look at other metrics, such as **feature importances**, to see what further insights I can gain about my data.

#### Production Model, Conclusions and Recommendations ####

As stated above, I'll select my production model based on the model that is best optimized for **sensitivity**, with some consideration given to **accuracy** as well.

I'll then present conclusions, as well as recommendations for moving forward.

## Data Collection ##

In [1]:
import pandas as pd
import requests

In [2]:
base_url = 'https://api.pushshift.io/reddit/search/submission?'
r_dogs = 'subreddit=dogs'
r_parenting = 'subreddit=parenting'
size = 500

In [3]:
res = requests.get(f'{base_url}{r_dogs}&size={size}')
res.status_code
# Utilized https://github.com/pushshift/api and lesson 503 for help with API requests

200

In [4]:
data = res.json()
data.keys()

dict_keys(['data', 'error', 'metadata'])

It looks like I'll have to go one level further into the data to get the actual entries, likely under the 'data' key.

Let's look at the first entry:

In [5]:
data['data'][0]

{'subreddit': 'dogs',
 'selftext': 'We have a 1 year old maltese shihtzu, we have done the bare minimum of meeting on lead, as we have tried to aim for ignoring and being neutral around dogs ..\nI feel we have got to a goood point where he can ignore dogs etc… I’m just wondering how do we go about on lead meetings. \nLet’s Say we have let him meet 8 dogs on lead, there has been 3 times, he has kind of  air snapped towards the other dog.. \nwhat is the reasons behind this? Should I just not let him greet other dogs? How can I help him greet nicely ? \n\nAny advice, information would be appreciated 😀\n\nWe are starting group classes on the weekend to help with socialization and general obedience….',
 'author_fullname': 't2_hpma2uhc',
 'gilded': 0,
 'title': 'On lead greetings',
 'link_flair_richtext': [{'e': 'text', 't': '[Misc Help]'}],
 'subreddit_name_prefixed': 'r/dogs',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': 'help',
 'thumbnail_height': None,
 'top_awarded_type': Non

I have successfully located the data I want. Next I'll convert it to a dataframe:

In [6]:
dogs = pd.DataFrame(data['data'])
dogs.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,crosspost_parent_list,url_overridden_by_dest,crosspost_parent,edited_on,author_cakeday
0,dogs,"We have a 1 year old maltese shihtzu, we have ...",t2_hpma2uhc,0,On lead greetings,"[{'e': 'text', 't': '[Misc Help]'}]",r/dogs,False,6,help,...,1677716189,1677716190,2023-03-02 00:16:11,,,,,,,
1,dogs,"Color, pile thickness, pattern?\n\nI have a mi...",t2_3taem,0,Home carpet recommendations?,"[{'e': 'text', 't': '[Misc Help]'}]",r/dogs,False,6,help,...,1677716137,1677716138,2023-03-02 00:15:20,,,,,,,
2,dogs,"Hello, Noah is a white german shepherd, he has...",t2_3w8o4mpt,0,Dog epilepsy,"[{'e': 'text', 't': '[Health]'}]",r/dogs,False,6,,...,1677716070,1677716070,2023-03-02 00:14:14,,,,,,,
3,dogs,So I’ve moved in to my aunts house and she own...,t2_7xddgy2c,0,Advice needed,"[{'e': 'text', 't': '[Discussion]'}]",r/dogs,False,6,discussion,...,1677715950,1677715951,2023-03-02 00:12:16,,,,,,,
4,dogs,This is an app that could be useful to find a...,t2_15qg9q,0,Mappet is an app for dogs lovers,"[{'e': 'text', 't': '[Breeds] 📝Recommendation'}]",r/dogs,False,6,breeds,...,1677715532,1677715533,2023-03-02 00:05:20,self,{'images': [{'source': {'url': 'https://extern...,,,,,


In [7]:
dogs.shape

(500, 94)

For subsequent requests, I need to ensure that I will not be pulling duplicate data. The **pushshift/api** has a parameter that I can use for this purpose. The **'before'** parameter takes a 'utc' datetime, and returns only entries that were posted before that datetime. I can take the earliest datetime from my first request, and set it as the **'before'** parameter for my second, and so on. In that way, there will be no overlap in the timespan of the posts captured in each request

In [8]:
utc = dogs['created_utc'].min()

for _ in range(3):
    res = requests.get(f'{base_url}{r_dogs}&size={size}&before={utc}')
    dogs = pd.concat(objs = [dogs, pd.DataFrame(res.json()['data'])])
    utc = dogs['created_utc'].min()

In [9]:
dogs.shape

(2000, 94)

In [10]:
dogs['created_utc'].value_counts()

1676821862    2
1676762853    2
1677716171    1
1677024664    1
1677020592    1
             ..
1677361920    1
1677363606    1
1677363879    1
1677363888    1
1676688620    1
Name: created_utc, Length: 1998, dtype: int64

Now I'll do the same for posts from the **r/parenting** subreddit:

In [11]:
res = requests.get(f'{base_url}{r_parenting}&size={size}')
kids = pd.DataFrame(res.json()['data'])
kids.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,edited_on,author_cakeday
0,Parenting,Has anyone found a way to help their child man...,t2_c3vi6x5j,0,Perfectionism in 6 year old,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677715022,1677715023,2023-03-01 23:56:49,,,,
1,Parenting,My 8 year old regularly FaceTimes with four fr...,t2_dopol9qi,0,My 8 year old has put me in an uncomfortable p...,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677715022,1677715023,2023-03-01 23:56:48,,,,
2,Parenting,I love my kid. I’m more than him jumping aroun...,t2_4907o7c,0,Bathtub cracking/falling through ceiling becau...,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677714979,1677714980,2023-03-01 23:56:03,,,,
3,Parenting,"My 5 yr old never wants to come home, she thro...",t2_4ms79k23,0,I need advice on my 5yr old,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677714429,1677714430,2023-03-01 23:46:56,,,,
4,Parenting,My toddler is too big for 5t clothes now - mos...,t2_7od5uzkz,0,what comes after 5t - why is sizing so confusing,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677714001,1677714001,2023-03-01 23:39:47,,,,


In [12]:
utc = kids['retrieved_utc'].min()

for _ in range(3):
    res = requests.get(f'{base_url}{r_parenting}&size={size}&before={utc}')
    kids = pd.concat(objs = [kids, pd.DataFrame(res.json()['data'])])
    utc = kids['created_utc'].min()
    
kids.shape

(2000, 91)

Now that I have all of my data, I'll combine it into a single dataframe:

In [13]:
dogs_v_kids = pd.concat(objs = [dogs, kids])
dogs_v_kids.shape

(4000, 94)

When I constructed my requests, I made sure that I was not pulling any duplicate data using the **'before'** parameter. Items do get reposted, however, so there may still be duplicates. There may even be posts that are cross-posted in both subreddits

I'll check for duplicates on the **'selftext'** variable. I could use **'title'**, but I'm more concerned about false duplicates there, as I can envision a scenario where two posts with different content could have the same title

In [14]:
dogs_v_kids['selftext'].duplicated().value_counts()

False    3326
True      674
Name: selftext, dtype: int64

It looks like there are a number of duplicates. I really only want unique posts, so I will drop the duplicate posts, keeping one copy of each

In [18]:
dogs_v_kids.drop_duplicates(subset = 'selftext', keep = 'first', inplace=True)
# Help with .duplicated() and .drop_duplicates() methods from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

In [19]:
dogs_v_kids['subreddit'].value_counts()

dogs         1761
Parenting    1565
Name: subreddit, dtype: int64

After removing duplicates, I still have a good number of posts from each subreddit. I'm satisfied with my data collection, so I'll save my dataframe to a .csv, and move on to data cleaning and EDA:

In [21]:
dogs_v_kids.to_csv('../data/dogs_v_children.csv', index=False)