# <img src="../images/vegan-logo-resized.png" style="float: right; margin: 10px;">

# Data Cleaning and Exploratory Data Analysis

Author: Gifford Tompkins

---

Project 03 | Notebook 1 of 6

## OBJECTIVE
This notebook will establish a Base Model to compare our final model's success to. We will then clean the data and make it ready fro analysis. We will then begin some Exploratory Data Analysis and attempt to get a sense of whether or not we will be able to answer our problem statement given our body of data. If so, we will also have a sense for how to develop a strategy for building our model.

# Import Libraries and Dataset

In [1]:
import pandas as pd
import numpy as np
import time
import regex as re

from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from project_functions.data_cleaning_and_eda import clean_string

In [2]:
df = pd.read_csv('../data/corpus.csv')
df = df.drop_duplicates()

## Drop Duplicates
Before establishing a baseline model, we will remove any duplicates from our data set to establish a valid class distribution.

In [3]:
df['vegan'].value_counts(normalize=True)

0    0.501411
1    0.498589
Name: vegan, dtype: float64

# Baseline Model

Our baseline model is the majority class distribution of our data set. We will attempt to create a model with more accuracy than a naive guess. 

We have a class distribution of 

|Vegan|Vegetarian|
|---|---|
|49.9%|50.1%|

So our baseline score is that of the majority class **Vegetarians**, at **50.42%**.

In [4]:
base_score = 1 - df['vegan'].mean()

# Data Cleaning
We will look through our data and see if anything needs to be cleaned.

In [5]:
df.head(10)

Unnamed: 0,title,selftext,vegan
0,My ‘100 calories club’ which helps people visu...,,0
1,Chilli fritters stuffed with a mix of raw onio...,,0
2,Lentil and mushroom gravy with sweetpotato mas...,,0
3,How to get excited about vegetables?,I stopped eating meat after 24 years due to a ...,0
4,Homemade Rasgulla - Bengali Spongy Milk Sweets...,,0
5,Homemade Rasgulla - Bengali Spongy Milk Sweets...,,0
6,Supporting a vegetarian diet!,,0
7,Leek and mint puree with oyster mushroom 'scal...,,0
8,What's for Dinner? Discussion,Welcome to our weekly discussion on what you’r...,0
9,Lumpin beans! 36% of protein,,0


In [6]:
df.isnull().mean()

title       0.000000
selftext    0.582756
vegan       0.000000
dtype: float64

## Missing Values
The `'selftext'` column has many `null` values as well as several instances of the phrase `'[removed]'`. This is how the API records the fact that a post contained body text but that text was then removed (either by the user or the subreddit or reddit moderators).  

We will address both of these issues by replacing them with the empty string. The textual version of a `null` value.

Fortunately, `'titles'` and `'vegan'`  have no missing values.

In [7]:
mask_removed = df['selftext']=='[removed]'
df[mask_removed].groupby(by='vegan')['selftext'].count()

vegan
0    140
1    133
Name: selftext, dtype: int64

### Create `'removed'` column
Vegans twice as many `'[removed]'` posts as Vegetarians. This fact might end up being signaling, so I am going to save that information in a new column called 'removed_post'. I will then remove the `'[removed]'` string from the column.

In [8]:
df['removed'] = (mask_removed).astype(int)

In [9]:
# Confirm that the new column was created correctly.
df.groupby(by='vegan')['removed'].sum()

vegan
0    140
1    133
Name: removed, dtype: int64

In [10]:
df[mask_removed].head()

Unnamed: 0,title,selftext,vegan,removed
18,What are some brands of prepackaged stuff that...,[removed],0,1
23,Anyone has suggestion on good veggie sausages?...,[removed],0,1
50,"I’m Veg, my BF adores meat. What can we cook f...",[removed],0,1
75,"Ridge Gourd Gravy | Gravy For Dosa, Idli, Poor...",[removed],0,1
92,I'm going backpacking with some non vegetarian...,[removed],0,1


In [11]:
# Replace all '[removed]' values with null values
df['selftext'] = df['selftext'].where(~mask_removed,np.nan)

In [12]:
# Confirm that values have been replaced.
df[mask_removed]

Unnamed: 0,title,selftext,vegan,removed
18,What are some brands of prepackaged stuff that...,,0,1
23,Anyone has suggestion on good veggie sausages?...,,0,1
50,"I’m Veg, my BF adores meat. What can we cook f...",,0,1
75,"Ridge Gourd Gravy | Gravy For Dosa, Idli, Poor...",,0,1
92,I'm going backpacking with some non vegetarian...,,0,1
...,...,...,...,...
4032,ญี่ปุ่น ยันไม่ประกาศภาวะฉุกเฉินอีก แม้โควิดพุ่...,,1,1
4041,"What I think when people say ""plants feel pain...",,1,1
4050,นักเสี่ยงโชคลุ้นเลขเด็ดจากขันน้ำมนต์ในศาลเก่าแ...,,1,1
4051,"ดึงดูดทุกสายตา! ""ซังอา"" ตัวแม่โยคะหน้านิ่งสุดเ...",,1,1


### Imputing Empty Strings
Now, for the null values in the `selftext` column, we are going to impute empty strings.

In [13]:
df['selftext'] = df['selftext'].fillna('')

In [14]:
df['selftext'].isnull().sum()

0

In [15]:
df.head()

Unnamed: 0,title,selftext,vegan,removed
0,My ‘100 calories club’ which helps people visu...,,0,0
1,Chilli fritters stuffed with a mix of raw onio...,,0,0
2,Lentil and mushroom gravy with sweetpotato mas...,,0,0
3,How to get excited about vegetables?,I stopped eating meat after 24 years due to a ...,0,0
4,Homemade Rasgulla - Bengali Spongy Milk Sweets...,,0,0


# Create `'text'` Column
To simplify the vectorization, we will create a column with all of our textual data.

In [16]:
df['text'] = df['title'] + ' ' + df['selftext']
df.head()

Unnamed: 0,title,selftext,vegan,removed,text
0,My ‘100 calories club’ which helps people visu...,,0,0,My ‘100 calories club’ which helps people visu...
1,Chilli fritters stuffed with a mix of raw onio...,,0,0,Chilli fritters stuffed with a mix of raw onio...
2,Lentil and mushroom gravy with sweetpotato mas...,,0,0,Lentil and mushroom gravy with sweetpotato mas...
3,How to get excited about vegetables?,I stopped eating meat after 24 years due to a ...,0,0,How to get excited about vegetables? I stopped...
4,Homemade Rasgulla - Bengali Spongy Milk Sweets...,,0,0,Homemade Rasgulla - Bengali Spongy Milk Sweets...


## Lemmatize and Standardize Text Column
For our next piece of cleaning, we will use the custom function called `clean_string` that will strip any HTML-formatting elements from our string and then pass that string through a WordNetLemmatizer. 

The lemmatizer will reduce our vocabulary by converting words to their basic forms. 
- For example: "ran" and "run" will both be converted to "run" and counted as the same vocabulary word.

We may lose some signal by doing this, but it will help out analysis in the long-run. When we convert this data into its final form for analysis, every word or phrase will be considered a feature. Thus, if we can cut down the number of features, we will cut down the amount of time and processing power necessary to fit and evaluate our models.

To see the code and documentation for this function, see the [`data_cleaning`](./project_functions/data_cleaning_and_eda.py) code stored in the [`project_functions`](./project_functions/) folder in this repository.

## More Duplicates and the Reason for Beautiful Soup
Notice the first two titles of our dataset. They were not removed with our initial duplicate drop and are considered unique because of a few non textual elements.

In [32]:
print(df.loc[241,'title'])
print(df.loc[242,'title'])

Handmade Pasta in a Wild Chanterelle Alfredo Sauce
Handmade Pasta in a Wild Chanterelle 🍄Alfredo Sauce


In [34]:
# Python does not consider these two strings as identical.
df['text'][241] == df['text'][242]

False

>When we clean the strings, they will be converted into a form that is identified as identical. This will ultimately help us reduce the amount of _noise_ in our model.

In [29]:
# Check that the cleaned versions of our strings would be read as identical.
df['text'].map(clean_string)[241] == df['text'].map(clean_string)[242]

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

> When we pass the strings through `clean_string` they are interpreted as identical. We can use this to further refine our data, drop more duplicated columns and reduce more noise.  
>
> We will pass the `keep='last'` parameter into the drop duplicates method so as to keep the (closest to) original post, we will also only drop duplicates from the same subreddits. Finally, we will drop any posts that, after the cleaning, are only the empty string.

In [None]:
df['text'] = df['text'].map(clean_string)

In [None]:
print(df.shape)
df = df.drop_duplicates(subset=['text','vegan'],keep='last')
df = df[df['text'].str.strip()!='']
print(df.shape)
# Dropped: 174 rows

In [None]:
# Confirm one of our first duplicates have been resolved.
df.head(5)

# Save Cleaned Data Frame
We will save our cleaned data frame and use this in our subsequent exploration.

In [None]:
data_csv = '../datasets/data.csv'
df.to_csv(data_csv,index=False)

# Summary and Next Steps
In this notebook, we cleaned our data to prepare it for some exploratory analysis. We removed duplicates and empty strings and standardized our data from HTML formatting to plain text using BeautifulSoup and some custom functions. We then combined all of our textual data into a single corpus and saved this all to our [`data.csv`]('../datasets/data.csv').

In our next notebook, we will perform some exploratory analysis before building our first models.