# Classifying Recipes: Data Cleaning
This goal of this project is to classify recipe quality depending on the recipe's ratings and ingredients. The [dataset](https://www.kaggle.com/hugodarwood/epirecipes) was taken from Kaggle and is based off of recipes from [Epicurious](https://www.epicurious.com/recipes-menus). I will use a support vector machine classifier to make predictions.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
import warnings
warnings.filterwarnings('ignore')

import helpers as hp
from config import usr, pwd, url, port, db, table

%matplotlib inline

## Load Dataset

In [2]:
df = pd.read_csv('./data/epi_r.csv', encoding='latin')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20052 entries, 0 to 20051
Columns: 680 entries, title to turkey
dtypes: float64(679), object(1)
memory usage: 104.0+ MB


## Data Cleaning
Copy `DataFrame` to a new variable to preserve the intial, raw dataset.

In [4]:
df_clean = df.copy()

In [5]:
df_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,20052.0,3.714467,1.340829,0.0,3.75,4.375,4.375,5.0
calories,15935.0,6322.958017,359046.041242,0.0,198.00,331.000,586.000,30111218.0
protein,15890.0,100.160793,3840.318527,0.0,3.00,8.000,27.000,236489.0
fat,15869.0,346.877497,20456.106859,0.0,7.00,17.000,33.000,1722763.0
sodium,15933.0,6225.974895,333318.188891,0.0,80.00,294.000,711.000,27675110.0
...,...,...,...,...,...,...,...,...
cookbooks,20052.0,0.000150,0.012231,0.0,0.00,0.000,0.000,1.0
leftovers,20052.0,0.000349,0.018681,0.0,0.00,0.000,0.000,1.0
snack,20052.0,0.001396,0.037343,0.0,0.00,0.000,0.000,1.0
snack week,20052.0,0.000948,0.030768,0.0,0.00,0.000,0.000,1.0


In [6]:
df_described = df_clean.describe().T

### Separate Dataset by Types
#### Nutrition
Search for nutrition information by finding those columns with a max value greater than 5.

In [7]:
nutrition_list = df_described[df_described['max'] > 5].index.tolist()

#### Keywords
Search for keywords by finding those columsn with a max value less than or equal to 1.

In [8]:
keyword_list = df_described[df_described['max'] <= 1].index.tolist()

#### Remaining Columns
The columns that aren't keywords nor nutrition information are: _title_ and _rating_.

In [9]:
remaining_columns = ['title', 'rating']

In [10]:
print(len(nutrition_list) + len(keyword_list) + len(remaining_columns))
print(len(df_clean.columns))

680
680


Good! I have now created subset lists for the different types of data in the dataset. Let's examing missing values for each separately.

### Missing Values
#### Nutrition Information

In [11]:
hp.find_na_columns(df_clean.loc[:, nutrition_list], display_fractions=True)

Variables with missing values and their fraction of missing values:
calories    0.205316
protein     0.207560
fat         0.208608
sodium      0.205416
dtype: float64


In [12]:
df_clean.loc[:, nutrition_list].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calories,15935.0,6322.958017,359046.041242,0.0,198.0,331.0,586.0,30111218.0
protein,15890.0,100.160793,3840.318527,0.0,3.0,8.0,27.0,236489.0
fat,15869.0,346.877497,20456.106859,0.0,7.0,17.0,33.0,1722763.0
sodium,15933.0,6225.974895,333318.188891,0.0,80.0,294.0,711.0,27675110.0


**Observations:** Roughly 20% of the data on nutrition contain missing values. The minimum value for all nutrition information is 0. To handle missing values, I will set them equal to the unique value of -1. I can decide how to handle them during the modeling phase.

In [13]:
for column in nutrition_list:
    df_clean[column].fillna(value=-1, inplace=True)

In [14]:
df_clean.loc[:, nutrition_list].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calories,20052.0,5024.547127,320079.870055,-1.0,68.0,257.0,502.0,30111218.0
protein,20052.0,79.163824,3418.840071,-1.0,0.75,5.0,19.0,236489.0
fat,20052.0,274.3076,18198.230525,-1.0,0.0,12.0,28.0,1722763.0
sodium,20052.0,4946.855127,297126.722129,-1.0,6.0,164.0,571.0,27675110.0


In [15]:
hp.find_na_columns(df_clean.loc[:, nutrition_list], display_fractions=True)

Variables with missing values and their fraction of missing values:
Series([], dtype: float64)


#### Keywords

In [16]:
hp.find_na_columns(df_clean.loc[:, keyword_list], display_fractions=True)

Variables with missing values and their fraction of missing values:
Series([], dtype: float64)


**Observations:** The are no missing values for the keywords.

#### Remaining Columns

In [17]:
hp.find_na_columns(df_clean.loc[:, remaining_columns], display_fractions=True)

Variables with missing values and their fraction of missing values:
Series([], dtype: float64)


**Observations:** There are no missing values for the remaining columns.
## Save Dataset
Save cleaned dataset to a SQL database.

In [18]:
# Create the database if it doesn't exist
db_url = f"postgresql+psycopg2://{usr}:{pwd}@{url}:{port}/{db}"
if database_exists(db_url):
    pass
else:
    create_database(db_url)

In [19]:
engine = create_engine(f"postgresql+psycopg2://{usr}:{pwd}@{url}:{port}/{db}")

In [20]:
df_clean.to_sql(name=table, con=engine, index=False, if_exists='replace')

In [21]:
engine.dispose()