# Project 3: Web APIs and NLP <br>

#### Brandie Hatch

### Exploratory Data Analysis, Cleaning, and Feature Engineering

## Problem



Learn with Chewie presents:
Web API and NLP Services

What do the users of Reddit consider dog CARE vs. dog TRAINING?

__Data Dictionary__

Features used listed below:
| **Feature**      | **Type** | **Dataset** | **Description**                                           |
|------------------|----------|-------------|-----------------------------------------------------------|
| **subreddit**    | _object_ | df          | Subreddit Name (instance of Subreddit)                    |
| **title**        | _object_ | df          | Title of submission                                       |
| **selftext**     | _object_ | df          | Selftext of a submission (an empty string if a link post) |
| **author**       | _object_ | df          | Author (Redditor) of the submission                       |
| **name**         | _object_ | df          | Full ID of submission, prefixed with t4_                  |
| **ups**          | _int64_  | df          | Number of up-vote points for a submission                 |
| **downs**        | _int64_  | df          | Number of down-vote points for a submission               |
| **score**        | _int64_  | df          | Total points for a submission                             |
| **num_comments** | _int64_  | df          | Number of comments on the submission                      |

Created with: https://www.tablesgenerator.com/markdown_tables#                                                             

## Imports and Reading In Data

In [4]:
# python library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm

%matplotlib inline
pd.options.display.max_columns =999

import requests
import time
import re
import nltk

  from pandas import Int64Index as NumericIndex


In [5]:
# load data

dogtraining = pd.read_csv('./data/dogtraining.csv')
print(dogtraining.shape)
dogtraining.head()

(5130, 12)


Unnamed: 0,subreddit,id,title,selftext,author,name,ups,downs,score,num_comments,created_utc,over_18
0,Dogtraining,uijir1,Trick of the Month - May 2022 - Crawl Backwards,Welcome to the Trick of the Month!\n\nThis mon...,moo6,t3_uijir1,5,0,5,4,2022-05-04T16:25:25Z,False
1,Dogtraining,ujxbsz,Announcement - Puppy Enrichment AMA With Allie...,,Cursethewind,t3_ujxbsz,6,0,6,2,2022-05-06T14:05:34Z,False
2,Dogtraining,up52vw,How do I get a cafe/brewery dog?,I am sitting at a brewery right now and all th...,slothsandwhich,t3_up52vw,272,0,272,87,2022-05-13T17:11:13Z,False
3,Dogtraining,upd3e8,"Hi, does anybody know the company that makes t...",,Fluffy_Overlord_1995,t3_upd3e8,38,0,38,5,2022-05-14T01:47:21Z,False
4,Dogtraining,upf4dj,My 3 months samoyed forgot all his training af...,"As the title says, after having stomach issues...",osmancode,t3_upf4dj,23,0,23,15,2022-05-14T04:16:59Z,False


The Dog Training data set includes 5099 observations of eight variables. 

In [7]:
dogtraining.dtypes

subreddit       object
id              object
title           object
selftext        object
author          object
name            object
ups              int64
downs            int64
score            int64
num_comments     int64
created_utc     object
over_18           bool
dtype: object

In [8]:
dogtraining.isnull().sum()

subreddit         0
id                0
title             0
selftext        760
author            0
name              0
ups               0
downs             0
score             0
num_comments      0
created_utc       0
over_18           0
dtype: int64

In [9]:
dogtraining.describe()

Unnamed: 0,ups,downs,score,num_comments
count,5130.0,5130.0,5130.0,5130.0
mean,36.542885,0.0,36.542885,14.703704
std,94.937302,0.0,94.937302,34.123826
min,0.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,1.0
50%,4.0,0.0,4.0,4.0
75%,12.0,0.0,12.0,10.0
max,433.0,0.0,433.0,166.0


In [11]:
dogcare = pd.read_csv('./data/dogcare.csv')
dogcare.head()

Unnamed: 0,subreddit,id,title,selftext,author,name,ups,downs,score,num_comments,created_utc,over_18
0,DogCare,upgugb,Massages for hip dysplasia?,"He's a 6 yo boxer/mastiff,135 lbs and in good ...",Flaky_Watch,t3_upgugb,3,0,3,1,2022-05-14T06:04:21Z,False
1,DogCare,uow11b,"My dog has this weird thing on the tail, can a...",,NivTheGever,t3_uow11b,24,0,24,15,2022-05-13T09:50:09Z,False
2,DogCare,upa4qd,Lab/Great Dane puppy leg shattered,So my dog jumped out of my truck and completel...,Boomstick825,t3_upa4qd,0,0,0,3,2022-05-13T22:14:14Z,False
3,DogCare,up7lse,Any ideas what this might be?,I came home from work a couple of days ago and...,ChunkyMonkey3499,t3_up7lse,0,0,0,1,2022-05-13T19:35:38Z,False
4,DogCare,uoyffw,Anyone know what this could be? 7 Yr old Irish...,,Disastrous_Bobcat402,t3_uoyffw,1,0,1,0,2022-05-13T11:44:38Z,False


In [12]:
dogtraining.dtypes

subreddit       object
id              object
title           object
selftext        object
author          object
name            object
ups              int64
downs            int64
score            int64
num_comments     int64
created_utc     object
over_18           bool
dtype: object

In [13]:
dogcare.isnull().sum()

subreddit          0
id                 0
title              0
selftext        3300
author             0
name               0
ups                0
downs              0
score              0
num_comments       0
created_utc        0
over_18            0
dtype: int64

In [14]:
dogcare.describe()

Unnamed: 0,ups,downs,score,num_comments
count,7500.0,7500.0,7500.0,7500.0
mean,6.782,0.0,6.782,8.04
std,7.219849,0.0,7.219849,9.443002
min,0.0,0.0,0.0,0.0
25%,3.0,0.0,3.0,3.0
50%,4.0,0.0,4.0,4.0
75%,7.0,0.0,7.0,10.0
max,26.0,0.0,26.0,42.0


In [None]:
# remove rows with self text nulls??

In [None]:
# remove id, and those over 18 that are True (if any)
#dogcare.drop(columns=['id', 'created_utc', 'over_18'])

NameError: name 'dogcare' is not defined

In [None]:
# join the two DataFrames
#df = pd.concat([dogtraining, dogcare])

## Explore Data

In [None]:
# look at ups, downs, num_comments in comparison to the subreddit


### Lengths of Titles and Selftext descriptions

Determining if length of Titles and Selftext descriptions is worth more review by creating new columns with lengths to analyze.

In [None]:
# create a new column called title_length that contains the length of each title

df['title_length'] = df['title'].transform(len)
df.head()

In [None]:
# create a new column called title_word_count that contains the number of words in each title

df['title_word_count'] = df['title'].map(lambda x: len(x.split(' ')))

In [None]:
# create a new column called selftext_length that contains the length of each selftext

df['selftext_length'] = df['selftext'].transform(len)
df.head()

In [None]:
# create a new column called selftext_word_count that contains the number of words in each selftext

df['selftext_word_count'] = df['selftext'].map(lambda x: len(x.split(' ')))

### Longest and shortest Titles and Selftext descriptions

Determining if it is worth more review by looking at the five longest and five shortest Titles and Selftext description word counts.

In [None]:
df.sort_values(by='title_word_count')['title'].head(5)

NameError: name 'df' is not defined

In [None]:
df.sort_values(by='selftext_word_count')['selftext'].head(5)

In [None]:
df.sort_values(by='title_word_count', ascending=False)['title'].head(5)

In [None]:
df.sort_values(by='selftext_word_count', ascending=False)['selftext'].head(5)

### Distribution of lengths of Titles and Selftext descriptions

In [None]:
df['title_length'].hist()
plt.title('Distribution of Titles Posts by Character Length')
plt.xlabel('Character Count')
plt.ylabel('Title Count');

In [None]:
df['title_word_count'].hist()
plt.title('Distribution of Titles by Word Count')
plt.xlabel('Word Count')
plt.ylabel('Title Count');

### EDA Conclusions and Notes