# <img src="../images/vegan-logo-resized.png" style="float: right; margin: 10px;">

# Data Cleaning and Exploratory Data Analysis

Author: Gifford Tompkins

---

Project 03 | Notebook 1 of 6

## OBJECTIVE
This notebook will establish a Base Model to compare our final model's success to. We will then clean the data and make it ready fro analysis. We will then begin some Exploratory Data Analysis and attempt to get a sense of whether or not we will be able to answer our problem statement given our body of data. If so, we will also have a sense for how to develop a strategy for building our model.

# Import Libraries and Dataset

In [1]:
import pandas as pd
import numpy as np
import time
import regex as re

from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_csv('../datasets/api_data.csv')
df = df.drop_duplicates()

## Drop Duplicates
Before establishing a baseline model, we will remove any duplicates from our data set to establish a valid class distribution.

In [3]:
df['vegan'].value_counts(normalize=True)

0    0.502602
1    0.497398
Name: vegan, dtype: float64

# Baseline Model

Our baseline model is the majority class distribution of our data set. We will attempt to create a model with more accuracy than a naive guess. 

We have a class distribution of 

|Vegan|Vegetarian|
|---|---|
|49.7%|50.3%|

So our baseline score is that of the majority class **Vegetarians**, at **50.42%**.

In [4]:
base_score = 1 - df['vegan'].mean()

# Data Cleaning
We will look through our data and see if anything needs to be cleaned.

In [5]:
df.head(10)

Unnamed: 0,title,selftext,vegan
0,Lentil soup with sliced Beyond Meat sausage ta...,,0
1,Lentil soup with sliced Beyond Meat sausage ta...,,0
2,Recipes for thanksgiving?,Hi! My bfâ€™s sister is hosting thanksgiving thi...,0
3,What Happens IMMEDIATELY After You Die?? The T...,[https://www.youtube.com/watch?v=Dyvl\_8SoAjc...,0
4,Vegetarian Chili for a cold October evening,,0
5,Any suggestions for vegetarian meal delivery s...,[removed],0
6,Vegetarian =/= healthy. Garbage plate!,,0
7,Lazy chickpea pot pie!,,0
8,Ask me if I miss eating meat.,,0
9,Would you fill out this survey?,[removed],0


In [6]:
df.isnull().mean()

title       0.000000
selftext    0.568235
vegan       0.000000
dtype: float64

## Missing Values
The `'selftext'` column has many `null` values as well as several instances of the phrase `'[removed]'`. This is how the API records the fact that a post contained body text but that text was then removed (either by the user or the subreddit or reddit moderators).  

We will address both of these issues by replacing them with the empty string. The textual version of a `null` value.

Fortunately, `'titles'` and `'vegan'`  have no missing values.

In [7]:
mask_removed = df['selftext']=='[removed]'
df[mask_removed].groupby(by='vegan')['selftext'].count()

vegan
0    520
1    258
Name: selftext, dtype: int64

### Create `'removed'` column
Vegans twice as many `'[removed]'` posts as Vegetarians. This fact might end up being signaling, so I am going to save that information in a new column called 'removed_post'. I will then remove the `'[removed]'` string from the column.

In [8]:
df['removed'] = (mask_removed).astype(int)

In [9]:
# Confirm that the new column was created correctly.
df.groupby(by='vegan')['removed'].sum()

vegan
0    520
1    258
Name: removed, dtype: int64

In [10]:
df[mask_removed].head()

Unnamed: 0,title,selftext,vegan,removed
5,Any suggestions for vegetarian meal delivery s...,[removed],0,1
9,Would you fill out this survey?,[removed],0,1
31,Reddit hates vegetarians,[removed],0,1
34,Vegetarian in an omnivore family?,[removed],0,1
48,Supplements for new vegetarian?,[removed],0,1


In [11]:
# Replace all '[removed]' values with null values
df['selftext'] = df['selftext'].where(~mask_removed,np.nan)

In [12]:
# Confirm that values have been replaced.
df[mask_removed]

Unnamed: 0,title,selftext,vegan,removed
5,Any suggestions for vegetarian meal delivery s...,,0,1
9,Would you fill out this survey?,,0,1
31,Reddit hates vegetarians,,0,1
34,Vegetarian in an omnivore family?,,0,1
48,Supplements for new vegetarian?,,0,1
...,...,...,...,...
20729,Letâ€™s Convince Bumble To Add A Vegan Filter,,1,1
20903,How to be vegan with numerous food intolerances?,,1,1
20906,"If you eat meat, you only care about the envir...",,1,1
20943,herbal cigarettes?,,1,1


### Imputing Empty Strings
Now, for the null values in the `selftext` column, we are going to impute empty strings.

In [13]:
df['selftext'] = df['selftext'].fillna('')

In [14]:
df['selftext'].isnull().sum()

0

In [15]:
df.head()

Unnamed: 0,title,selftext,vegan,removed
0,Lentil soup with sliced Beyond Meat sausage ta...,,0,0
1,Lentil soup with sliced Beyond Meat sausage ta...,,0,0
2,Recipes for thanksgiving?,Hi! My bfâ€™s sister is hosting thanksgiving thi...,0,0
3,What Happens IMMEDIATELY After You Die?? The T...,[https://www.youtube.com/watch?v=Dyvl\_8SoAjc...,0,0
4,Vegetarian Chili for a cold October evening,,0,0


# Create `'text'` Column
To simplify the vectorization, we will create a column with all of our textual data.

In [16]:
df['text'] = df['title'] + ' ' + df['selftext']
df.head()

Unnamed: 0,title,selftext,vegan,removed,text
0,Lentil soup with sliced Beyond Meat sausage ta...,,0,0,Lentil soup with sliced Beyond Meat sausage ta...
1,Lentil soup with sliced Beyond Meat sausage ta...,,0,0,Lentil soup with sliced Beyond Meat sausage ta...
2,Recipes for thanksgiving?,Hi! My bfâ€™s sister is hosting thanksgiving thi...,0,0,Recipes for thanksgiving? Hi! My bfâ€™s sister i...
3,What Happens IMMEDIATELY After You Die?? The T...,[https://www.youtube.com/watch?v=Dyvl\_8SoAjc...,0,0,What Happens IMMEDIATELY After You Die?? The T...
4,Vegetarian Chili for a cold October evening,,0,0,Vegetarian Chili for a cold October evening


## Lemmatize and Standardize Text Column
For our next piece of cleaning, we will use the custom function called `clean_string` that will strip any HTML-formatting elements from our string and then pass that string through a WordNetLemmatizer. 

The lemmatizer will reduce our vocabulary by converting words to their basic forms. 
- For example: "ran" and "run" will both be converted to "run" and counted as the same vocabulary word.

We may lose some signal by doing this, but it will help out analysis in the long-run. When we convert this data into its final form for analysis, every word or phrase will be considered a feature. Thus, if we can cut down the number of features, we will cut down the amount of time and processing power necessary to fit and evaluate our models.

To see the code and documentation for this function, see the [`data_cleaning`](./project_functions/data_cleaning_and_eda.py) code stored in the [`project_functions`](./project_functions/) folder in this repository.

In [17]:
from project_functions.data_cleaning_and_eda import clean_string

## More Duplicates and the Reason for Beautiful Soup
Notice the first two titles of our dataset. They were not removed with our initial duplicate drop and are considered unique because of a few non textual elements.

In [18]:
print(df.loc[0,'title'])
print(df.loc[1,'title'])

Lentil soup with sliced Beyond Meat sausage tastes DELICIOUS.
Lentil soup with sliced Beyond Meat sausage tastes DELICIOUS ðŸ˜‹


In [19]:
# Python does not consider these two strings as identical.
df['text'][0] == df['text'][1]

False

>When we clean the strings, they will be converted into a form that is identified as identical. This will ultimately help us reduce the amount of _noise_ in our model.

In [20]:
# Check that the cleaned versions of our strings would be read as identical.
df['text'].map(clean_string)[0] == df['text'].map(clean_string)[1]

True

> When we pass the strings through `clean_string` they are interpreted as identical. We can use this to further refine our data, drop more duplicated columns and reduce more noise.  
>
> We will pass the `keep='last'` parameter into the drop duplicates method so as to keep the (closest to) original post, we will also only drop duplicates from the same subreddits. Finally, we will drop any posts that, after the cleaning, are only the empty string.

In [21]:
df['text'] = df['text'].map(clean_string)

In [22]:
print(df.shape)
df = df.drop_duplicates(subset=['text','vegan'],keep='last')
df = df[df['text'].str.strip()!='']
print(df.shape)
# Dropped: 174 rows

(19411, 5)
(19237, 5)


In [23]:
# Confirm one of our first duplicates have been resolved.
df.head(5)

Unnamed: 0,title,selftext,vegan,removed,text
1,Lentil soup with sliced Beyond Meat sausage ta...,,0,0,lentil soup with sliced beyond meat sausage ta...
2,Recipes for thanksgiving?,Hi! My bfâ€™s sister is hosting thanksgiving thi...,0,0,recipe for thanksgiving hi my bfs sister is ...
3,What Happens IMMEDIATELY After You Die?? The T...,[https://www.youtube.com/watch?v=Dyvl\_8SoAjc...,0,0,what happens immediately after you die the t...
4,Vegetarian Chili for a cold October evening,,0,0,vegetarian chili for a cold october evening
5,Any suggestions for vegetarian meal delivery s...,,0,1,any suggestion for vegetarian meal delivery se...


# Save Cleaned Data Frame
We will save our cleaned data frame and use this in our subsequent exploration.

In [24]:
data_csv = '../datasets/data.csv'
df.to_csv(data_csv,index=False)

# Summary and Next Steps
In this notebook, we cleaned our data to prepare it for some exploratory analysis. We removed duplicates and empty strings and standardized our data from HTML formatting to plain text using BeautifulSoup and some custom functions. We then combined all of our textual data into a single corpus and saved this all to our [`data.csv`]('../datasets/data.csv').

In our next notebook, we will perform some exploratory analysis before building our first models.