# Table of Contents
* [Exploring fenaroli parsing](#Exploring-fenaroli-parsing)
	* [Loading plaintext generated by slate (on AWS instance)](#Loading-plaintext-generated-by-slate-%28on-AWS-instance%29)
	* [RE for flavor compounds](#RE-for-flavor-compounds)
	* [RE for ingredients](#RE-for-ingredients)


In [25]:
from __future__ import division
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import math
from collections import Counter, defaultdict
# %load_ext autoreload
%autoreload 2

In [27]:
import cPickle as pickle
import slate 
import re

In [26]:
import matplotlib.pylab as plt
import matplotlib as mpl
plt.rcParams['figure.figsize'] = (12.0, 6.0)
%matplotlib inline
%load_ext base16_mplrc
%base16_mplrc light default

The base16_mplrc extension is already loaded. To reload it, use:
  %reload_ext base16_mplrc
Setting plotting theme to default-light. Palette available in b16_colors


# Exploring fenaroli parsing

([A-Z]{3,50}\s?[A-Z]{3,50}\s?[A-Z]{3,50}) # original re for the flavor compounds

## Loading plaintext generated by slate (on AWS instance)

In [4]:
with open('doc_joined.pkl', 'r') as f:
    joined = pickle.load(f)
with open('doc_split.pkl', 'r') as f:
    split = pickle.load(f)

In [5]:
cleaned_joined = joined.replace('\n\n', 'zxzxzxzx').replace('\n',' ').replace('.', ' ')

entries I'm interested in start at split doc 17

## RE for flavor compounds

In [53]:
sq = re.compile('(\d?-?[A-Z]{2,50}[,-|\s]?[A-Z]{2,50}[,-|\s]?[A-Z]{2,50})')
raw_flavor_compounds = re.findall(sq, joined)
# terms_to_remove = ['FEMA PADI','EINECS', 'EMA PADI', 'EINECS NO','FEMA PAD','JECFAN']

In [51]:
flavor_series = pd.Series(raw_flavor_compounds)
top_terms = list(flavor_series.value_counts().index.values[:10])

In [54]:
top_terms[:10]

['EINECS',
 'FEMA.PADI',
 '2-METHYL',
 'METHYL',
 '4-METHYL',
 'METHYLTHIO',
 '3-METHYL',
 '5-DIMETHYL',
 '5-METHYL',
 'EXTRACT']

In [8]:
flavor_compounds = [compound.replace('\n','') for compound in raw_flavor_compounds]

In [11]:
flavor_comp = [comp for comp in flavor_compounds if comp not in to_remove and comp[:2] != '-C']

In [9]:
cnt = Counter(flavor_compounds)
to_remove = [k for k, v in cnt.iteritems() if v > 1]

## RE for ingredients

In [49]:
sq = re.compile('Natural occurrence: (.+?)\zxzxzxzx')
raw_ingredients = re.findall(sq, cleaned_joined)

ingredient_series = pd.Series(raw_ingredients)

In [50]:
stop_strings = ingredient_series.value_counts()[:10].index[:4]

top_terms = list(ingredient_series.value_counts().index.values[:10])
stop_strings=top_terms[:4]+ top_terms[5:7] + [top_terms[8]]

In [45]:
cleaned_ingredients = [comp.strip() for comp in raw_ingredients if comp not in stop_strings]
cleaned_ingredients[:10]

['Present in some liquors (e g , sake, whiskey and cognac); also detected and quantitatively assessed in rums   Found in apple juice, orange juice, orange peel oil, bitter orange juice, strawberry fruit, raw radish, Chinese quince fruit, Chinese  quince flesh, udo (Aralia cordata Thunb )',
 'Reported found in oak and tobacco leaves; in the fruital aromas of pear, apple, raspberry, strawberry and pine- apple; in the distillation waters of Monarda punctata, orris, cumin, chenopodium; in the essential oils of Litsea cubeba, Magnolia  grandiflora, Artemisia brevifolia, rosemary, balm, clary sage, Mentha arvensis, daffodil, bitter orange, camphor, angelica, fennel,  mustard, Scotch blended whiskey, Japanese whiskey, rose wine, blackberry brandy and rum',
 'Reportedly present in grape brandy, apple brandy, rum, sherry and cider',
 'Reported found in guava fruit (Psidium guajava L ) and plum',
 'Reported found in vinegar, bergamot, cornmint oil, bitter orange oil, lemon petitgrain, various da