# Data Cleaning and Initial EDA Notebook

## Problem Statement

The goal of this project is to gather data from two different subreddits ([*r/pregnant*](https://www.reddit.com/r/Parenting/) & [*r/pareting*](https://www.reddit.com/r/pregnant/)) and then use Natural Language Processing (NLP) to train a classifier to be able to determine which subreddit a given post comes from. I chose r/parenting and r/pregnant as comparisons for this project as my partner and I are currently expecting our first child and I was interested in looking at the differences in language for individuals who are seeking support and/or posting about their pregnancy and those who are doing the same as parents. To examine the differences I will be conducting two classification models using Random Random Forests and logistic regression. A successful project will be one where the classification model is able to accurately predict which subreddit a post comes from based on its text. I believe the results of this project may give first-time parents an idea of the changes that occur when transitioning from pregnanancy to becoming a parent.

## Library Imports and Reading in Data

In [1]:
#Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#NLTK or Natural Language Toolkit imports (based on imports from lesson 5.03)
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

pd.set_option('display.max_colwidth' ,999)

In [2]:
#Initial CSVs from scrape uncomment to see the initial raw data or check in data folder
#preg_df = pd.read_csv('./data/Raw_data/pregnant_submissions.csv').drop(columns= 'Unnamed: 0')
#parent_df = pd.read_csv('./data/Raw_data/parent_submissions.csv').drop(columns= 'Unnamed: 0')

## Data Investigation

In [3]:
#comparison of unique posts from each reddit
#based on the raw data using dataframes in the cell above
#print(f'Out of', preg_df[['selftext']].shape[0], 'rows from r/pregnant,', preg_df[['selftext']].nunique()[0], 'are unique!')
#print(f'Out of', parent_df[['selftext']].shape[0], 'rows from r/parenting,', parent_df[['selftext']].nunique()[0], 'are unique!')

In [4]:
#CSVs with lems after tokenizing and lemmatizing
preg_df = pd.read_csv('./data/Lem_data/pregnant_lems.csv').drop(columns= 'Unnamed: 0')
parent_df = pd.read_csv('./data/Lem_data/parent_lems.csv').drop(columns= 'Unnamed: 0')

#### Investigations Author with highest number of posts for r/pregnant and r/parenting

In [5]:
#uncomment individual lines to run each
#investigating Authors on r/pregnant
#preg_df['author'].nunique() # = 35_417
#top_10posters = preg_df['author'].value_counts()[:11]
#preg_df['author'].value_counts()[0:101] #top 100 authors have made between 34 and 100 posts on r/pregnant
#preg_df['author'].value_counts()[:1500] #top 1500 postes have made 10 or more posts

In [6]:
#uncomment individual lines to run each
#parent_df['author'].nunique() #= 51150
#parent_df['author'].value_counts()[0:11] #top ten authors made 50 -76 posts
#parent_df['author'].value_counts()[:51] #top 50 authors made 23 - 76 posts 
#parent_df['author'].value_counts()[:101] #top 100 authors made 17 - 76 posts 
#parent_df['author'].value_counts()[0:386] # top 385 have made over 10 posts 

### Brief description of the data scraped from each subreddit:
* [r/pregnant](https://www.reddit.com/r/pregnant/) :
    * Time Frame from Saturday, May 27, 2017 7:44:08 AM MST to Saturday, October 30, 2021 6:53:17 PM MST
    * Out of 99,978 posts scraped, 94,671 were unique posts.
    * 35,419 unique authors, meaning that on average each author has made 2.67 different posts in this time frame.
        * However the actual spread of posts is fairly top heavy with the top 1500 authors making 10 or more posts

* [r/parenting](https://www.reddit.com/r/parenting/) :
    * Time Frame from Sunday, February 10, 2019 2:57:49 PM MST to Saturday, October 30, 2021 6:45:37 PM MST
    * Out of 99,988 posts scraped, 79,174 were unique posts.
    * 51,150 unique authors, meaning that on average each author has made 1.55 posts in this time frame
        * However the actual spread of posts is largely top heavy with the top 385 authors making 10 or more posts


## EDA/Data Cleaning

### 1. Duplicates and Rearranging Columns

In [7]:
#from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html & 
#https://pandas.pydata.org/docs/reference/api/pandas.Series.drop_duplicates.html#pandas.Series.drop_duplicates
# 5306 duplicates for r/pregnant
preg_df.drop_duplicates(subset= ['selftext'], inplace= True)

In [8]:
#20_814 dupliates in r/parenting
parent_df.drop_duplicates(subset= ['selftext'], inplace= True)

In [9]:
#no longer need the created_utc column as that was solely to ensure that the web scrape loop kept pulling new data
#was used to drop from the raw data
#preg_df.drop(columns=['created_utc'], inplace= True)
#parent_df.drop(columns=['created_utc'], inplace= True)

### 2. Nulls and Data Check

In [13]:
preg_df.isna().sum()

lems          14
selftext       0
title          0
title_lems    15
author         0
subreddit      0
dtype: int64

In [14]:
parent_df.isna().sum()

lems          3
selftext      0
title         0
title_lems    8
author        0
subreddit     0
dtype: int64

In [15]:
#with very few nulls in each df we can just drop the nulls, not likely to make a large impact given the large scale of data scraped 
preg_df.dropna(inplace= True)
parent_df.dropna(inplace=True)

In [20]:
#Doublecheck that nulls dropped
#preg_df.isna().sum()

In [21]:
#parent_df.isna().sum()

In [23]:
#previously removed from the raw data
#only 1 'removed' that was in r/pregnant
#preg_df[preg_df['selftext'] == '[removed]']

In [24]:
#preg_df.drop(139, axis= 0, inplace= True)

In [26]:
#previously removed from the raw data
#parent_df[parent_df['selftext'] == '[removed]']

In [27]:
#parent_df.drop(0, axis=0, inplace= True)

In [29]:
#resetting index after the drops
#parent_df.reset_index(inplace=True)
#preg_df.reset_index(inplace= True)

### 3. Lemmatizing Reddit Posts from data files

All of the cells below have been commented out as they were used to tokenize and lemmatize the initial raw data scrape from r/pregnant and r/parenting.

In [30]:
#using Lemmatizer and Tokenizer to tokenize and lem the data
#tokenizer = RegexpTokenizer(r'\w+')
#lemmatizer = WordNetLemmatizer()

In [31]:
#commented out
#Adapted from lemma function from John Hazard reddit Project as Example
#(https://github.com/JDHazard/web_scraping_reddit_classification_modeling/blob/master/notebooks/data_cleaning_and_eda.ipynb)
#lemmatizer Function
#def lemma(text):
#    tokens = tokenizer.tokenize(text.lower())
#    lems   = [lemmatizer.lemmatize(i) for i in tokens]
    
#    text = ' '.join(lems)
#    return text

In [32]:
#lemma(parent_df['selftext'])

In [33]:
#Lemmatizing selftext for r/pregnant
#preg_df['lems'] = [lemma(i) for i in preg_df['selftext']]
#preg_df.head(3)

In [34]:
#Lemmatizing selftext for r/parenting
#parent_df['lems'] = [lemma(i) for i in parent_df['selftext']]
#parent_df.head(3)

In [36]:
#Lemmatizing title for r/pregnant
#preg_df['title_lems'] = [lemma(i) for i in preg_df['title']]
#preg_df.head(3)

In [37]:
#Lemmatizing title for r/parenting
#parent_df['title_lems'] = [lemma(i) for i in parent_df['title']]
#preg_df.head(3)

### 4. Second Data Check and Merging of r/parent and r/pregnant Dataframes

Some of the cells below have been commented out as they were used to tokenize and lemmatize the initial raw data scrape from r/pregnant and r/parenting.

In [43]:
#initially had a 'level_0' column to drop as well
parent_df.drop(columns=['index'], inplace= True)

#### Rearranging Columns After Addition of Lems

In [47]:
#earrange columns process from https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
cols = list(parent_df.columns)
cols

['lems', 'selftext', 'title', 'title_lems', 'author', 'subreddit']

In [48]:
cols = [cols[-2] , cols[0], cols[3], cols[-1], cols[1] , cols[2]]
cols

['author', 'lems', 'title_lems', 'subreddit', 'selftext', 'title']

In [53]:
#rearrange columns
preg_df = preg_df[cols]
parent_df = parent_df[cols]

In [50]:
#sending dfs with lems to csv
#preg_df.to_csv('./data/pregnant_lems.csv')
#parent_df.to_csv('./data/parent_lems.csv')

Next steps occurred in the Notebook2_Preproccessing