# <img src="../images/vegan-logo-resized.png" style="float: right; margin: 10px;">

# Data Cleaning and Exploratory Data Analysis

Author: Gifford Tompkins

---

Project 03 | Notebook 1 of 6

## OBJECTIVE
This notebook will establish a Base Model to compare our final model's success to. We will then clean the data and make it ready fro analysis. We will then begin some Exploratory Data Analysis and attempt to get a sense of whether or not we will be able to answer our problem statement given our body of data. If so, we will also have a sense for how to develop a strategy for building our model.

# Import Libraries and Dataset

In [20]:
import pandas as pd
import numpy as np
import time
import regex as re

from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [21]:
df = pd.read_csv('../data/api_data.csv')
df = df.drop_duplicates()

## Drop Duplicates
Before establishing a baseline model, we will remove any duplicates from our data set to establish a valid class distribution.

In [22]:
df['vegan'].value_counts(normalize=True)

0    0.507898
1    0.492102
Name: vegan, dtype: float64

# Baseline Model

Our baseline model is the majority class distribution of our data set. We will attempt to create a model with more accuracy than a naive guess. 

We have a class distribution of 

|Vegan|Vegetarian|
|---|---|
|49.7%|50.3%|

So our baseline score is that of the majority class **Vegetarians**, at **50.42%**.

In [23]:
base_score = 1 - df['vegan'].mean()

# Data Cleaning
We will look through our data and see if anything needs to be cleaned.

In [24]:
df.head(10)

Unnamed: 0,title,selftext,vegan
0,"Due to lack of options at the store right now,...",,0
1,Fettuccine Alfredo with mushrooms and broccoli...,,0
2,Savoury semolina crepe with veggies and schezw...,,0
3,Good ol avocado toast,,0
4,Hummus,,0
5,Avocado pasta,,0
6,Garlic chilli lemon mushroom penne,,0
7,"Homemade rice bowl with a sunny up egg, roaste...",,0
8,Red curry fried rice with baked tofu,,0
11,Easy No Yeast Beer Bread (with vegan alternati...,,0


In [25]:
df.isnull().mean()

title       0.000000
selftext    0.581108
vegan       0.000000
dtype: float64

## Missing Values
The `'selftext'` column has many `null` values as well as several instances of the phrase `'[removed]'`. This is how the API records the fact that a post contained body text but that text was then removed (either by the user or the subreddit or reddit moderators).  

We will address both of these issues by replacing them with the empty string. The textual version of a `null` value.

Fortunately, `'titles'` and `'vegan'`  have no missing values.

In [26]:
mask_removed = df['selftext']=='[removed]'
df[mask_removed].groupby(by='vegan')['selftext'].count()

vegan
0    692
1    307
Name: selftext, dtype: int64

### Create `'removed'` column
`Vegans` has twice as many `'[removed]'` posts as `Vegetarians`. This fact might end up being signaling, so I am going to save that information in a new column called 'removed_post'. I will then remove the `'[removed]'` string from the column.

In [27]:
df['removed'] = (mask_removed).astype(int)

In [28]:
# Confirm that the new column was created correctly.
df.groupby(by='vegan')['removed'].sum()

vegan
0    692
1    307
Name: removed, dtype: int64

In [29]:
df[mask_removed].head()

Unnamed: 0,title,selftext,vegan,removed
25,Homemade frozen veggie burgers turned mushy,[removed],0,1
28,New to vegetarianism and just found out about ...,[removed],0,1
129,relationship help,[removed],0,1
155,Cooking mama wannabe,[removed],0,1
161,Start With You,[removed],0,1


In [30]:
# Replace all '[removed]' values with null values
df['selftext'] = df['selftext'].where(~mask_removed,np.nan)

In [31]:
# Confirm that values have been replaced.
df[mask_removed]

Unnamed: 0,title,selftext,vegan,removed
25,Homemade frozen veggie burgers turned mushy,,0,1
28,New to vegetarianism and just found out about ...,,0,1
129,relationship help,,0,1
155,Cooking mama wannabe,,0,1
161,Start With You,,0,1
...,...,...,...,...
19868,Mmmm meat is so tasty I love it,,1,1
19872,Depression Meals,,1,1
19937,that is veggie patty,,1,1
19961,Dont kill the animals,,1,1


### Imputing Empty Strings
Now, for the null values in the `selftext` column, we are going to impute empty strings.

In [32]:
df['selftext'] = df['selftext'].fillna('')

In [33]:
df['selftext'].isnull().sum()

0

In [34]:
df.head()

Unnamed: 0,title,selftext,vegan,removed
0,"Due to lack of options at the store right now,...",,0,0
1,Fettuccine Alfredo with mushrooms and broccoli...,,0,0
2,Savoury semolina crepe with veggies and schezw...,,0,0
3,Good ol avocado toast,,0,0
4,Hummus,,0,0


# Create `'text'` Column
To simplify the vectorization, we will create a column with all of our textual data.

In [35]:
df['text'] = df['title'] + ' ' + df['selftext']
df.head()

Unnamed: 0,title,selftext,vegan,removed,text
0,"Due to lack of options at the store right now,...",,0,0,"Due to lack of options at the store right now,..."
1,Fettuccine Alfredo with mushrooms and broccoli...,,0,0,Fettuccine Alfredo with mushrooms and broccoli...
2,Savoury semolina crepe with veggies and schezw...,,0,0,Savoury semolina crepe with veggies and schezw...
3,Good ol avocado toast,,0,0,Good ol avocado toast
4,Hummus,,0,0,Hummus


## Lemmatize and Standardize Text Column
For our next piece of cleaning, we will use the custom function called `clean_string` that will strip any HTML-formatting elements from our string and then pass that string through a WordNetLemmatizer. 

The lemmatizer will reduce our vocabulary by converting words to their basic forms. 
- For example: "ran" and "run" will both be converted to "run" and counted as the same vocabulary word.

We may lose some signal by doing this, but it will help out analysis in the long-run. When we convert this data into its final form for analysis, every word or phrase will be considered a feature. Thus, if we can cut down the number of features, we will cut down the amount of time and processing power necessary to fit and evaluate our models.

To see the code and documentation for this function, see the [`data_cleaning`](./project_functions/data_cleaning_and_eda.py) code stored in the [`project_functions`](./project_functions/) folder in this repository.

In [40]:
from project_functions.data_cleaning_and_eda import clean_string

## More Duplicates and the Reason for Beautiful Soup
Notice the first two titles of our dataset. They were not removed with our initial duplicate drop and are considered unique because of a few non textual elements.

In [41]:
print(df.loc[0,'title'])
print(df.loc[1,'title'])

Due to lack of options at the store right now, tried the cracked pepper slices from Tofurkey, a brand I don’t usually like too much, but these were actually pretty yummy.
Fettuccine Alfredo with mushrooms and broccolini and sourdough toast.


In [42]:
# Python does not consider these two strings as identical.
df['text'][0] == df['text'][1]

False

>When we clean the strings, they will be converted into a form that is identified as identical. This will ultimately help us reduce the amount of _noise_ in our model.

In [43]:
# Check that the cleaned versions of our strings would be read as identical.
df['text'].map(clean_string)[0] == df['text'].map(clean_string)[1]

False

> When we pass the strings through `clean_string` they are interpreted as identical. We can use this to further refine our data, drop more duplicated columns and reduce more noise.  
>
> We will pass the `keep='last'` parameter into the drop duplicates method so as to keep the (closest to) original post, we will also only drop duplicates from the same subreddits. Finally, we will drop any posts that, after the cleaning, are only the empty string.

In [44]:
df['text'] = df['text'].map(clean_string)

In [46]:
print(df.shape)
r_0 = df.shape[0]
df = df.drop_duplicates(subset=['text','vegan'],keep='last')
df = df[df['text'].str.strip()!='']
print(df.shape)
print(f"Dropped {df.shape[0] - r_0} row(s).")

(18987, 5)
(18987, 5)
Dropped 0 row(s).


In [47]:
# Confirm one of our first duplicates have been resolved.
df.head(5)

Unnamed: 0,title,selftext,vegan,removed,text
0,"Due to lack of options at the store right now,...",,0,0,due to lack of option at the store right now ...
1,Fettuccine Alfredo with mushrooms and broccoli...,,0,0,fettuccine alfredo with mushroom and broccolin...
2,Savoury semolina crepe with veggies and schezw...,,0,0,savoury semolina crepe with veggie and schezwa...
3,Good ol avocado toast,,0,0,good ol avocado toast
4,Hummus,,0,0,hummus


# Save Cleaned Data Frame
We will save our cleaned data frame and use this in our subsequent exploration.

In [48]:
data_csv = '../data/data.csv'
df.to_csv(data_csv,index=False)

# Summary and Next Steps
In this notebook, we cleaned our data to prepare it for some exploratory analysis. We removed duplicates and empty strings and standardized our data from HTML formatting to plain text using BeautifulSoup and some custom functions. We then combined all of our textual data into a single corpus and saved this all to our [`data.csv`]('../data/data.csv').

In our next notebook, we will perform some exploratory analysis before building our first models.