# Recipe Text Analysis

## Introduction

This analysis is both an excercise and a brief on common text mining and preprocessing methods. The end goal here is to cover different scenarios where text analysis can be beneficial and fun! The code is written in a style that emphasizes readability (sticking to Python's ethos) and also provide a base that can be expanded upon.

Since my experience with object-oriented programming is minimal I preferred a more functional style of writing code. There are possibly more efficient ways of producing the same results; however the goal is not optimization or efficiency.

This NoteBook doesn't assume previous knowledge on any of the topics discussed, and the code is heavily documented and commented on to explain what everything does. This notebook is __interactive__, to see results you have to run each ```code block``` by clicking the run button at the top or pressing Shift + Enter, or you could just run all of it at once and see the results using the run all button.

In [1]:
# Importing the necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis.gensim
from text_preprocessor import text_prepro # custom preprocessing function
from model_maker import tfidf_model_maker, lda_model_maker # custom model making function

pd.options.display.max_colwidth = 150

  from .optimizers import Adam, SGD, linear_decay
  from collections import defaultdict, Sequence, Sized, Iterable, Callable
  from collections import defaultdict, Sequence, Sized, Iterable, Callable
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


***
## The Data

The recipe data was collected through scraping from 3 websites. Surprisingly enough, I didn't find many readily available data-sets for recipes. I chose recipes for the following reasons:
1. The text of the method of preparation itself is on the medium to small scale in terms of length. Which not only means it won't fry my machine when trying to work with the text but will also produce quicker results.
2. Because of the coherent theme, they offer more interesting opportunities for analyzing the text
3. As is, the data is relatively clean, no weird characters, no missing data, no misplaced delimiters/separators, although as I will demonstrate it will require some preprocessing.
4. It's _fun!_

By choosing scraping over getting clean data I can sink my teeth into the preprocessing of them, in addition to having a light introduction to web scraping. There are in total __2794__ recipes, and I collected the recipe title, ingredients and method. They are stored in a comma-separated values (csv) file and without any preprocessing done.

In [None]:
# Loading the data into a pandas DataFrame
recipe_file = 'data_sets/full_recipes.csv'
recipe_df = pd.read_csv(recipe_file, index_col=0)

recipe_df.head()

***
## The Preprocessing

### What _is_ preprocessing?
Any steps that change the raw data prior to analysis or building different models is considered as preprocessing. It can be as simple as removing empty fields or missing values and can spiral up in complexity quickly as the expected outcomes become more demanding. In the realm of text analysis it is arguably the most important step since it can greatly have an effect on the outcome. Depending on the expected outcomes the steps differ from one project to another, however some of the methods here are almost universal in their nature that they can be applied ubiquitously in all text projects such as changing all the characters to lower case.

### Goals of preprocessing:
The main goals of preprocessing the text in this project is to:
1. Break down the text documents (a single document here is a recipe method) into units, commonly referred to as tokens
2. Reduce the number of _unique_ tokens
3. Remove _unnecessary_ tokens

The order by which these goals are achieved becomes more important as the size of documents becomes larger and having optimized processes becomes a necessity. Moving forward, the first of the aforementioned goals is typically the last step in the process, wherein a text (in the recipes case it is the long text of the method) is split into single tokens of text; that can be one, two or three word combinations. The second goal of reducing the tokens is achieved by:

1. Lower casing all the texts
2. Replacing different word forms with a single one through:
    * Lemmatization: Where words are replaced with their lemma or root form. For example, lemmatizing `developed developers developing developments` will result in `develop developer develop development`
    * Replacement dictionaries: Where specific words are replaced with a single unique form. For example, this text ``` cinnamon, cinnamon powder, ground cinnamon``` becomes ```cinnamon, cinnamon, cinnamon```

The reasoning here is to condense as much of the text as possible to allow for better representation of the semantics of the text instead of having multiple word forms, all conveying the same meaning. In the above example, all 3 different forms of cinnamon can be combined to one form without losing the semantics of the term.

Achieving the third goal of removing unnecessary tokens is achieved through:
1. Removing highly common non-specific words, also known as stop-words or exclusion words
2. Removing any unwanted characters (specific for this analysis) such as repeating spaces, numerical characters and single alphabetical characters which in this case add no meaning or undestanding


### The outcome:
A _clean_ document that doesn't contain unwanted words and is split into tokens.

### Quick note about the replacement dictionary:
This is an entirely optional step, but because I want to get some deeper analysis on herbs and spices I've added a special dictionary for them.

In [None]:
# Cleaning the text of the recipes using the preprocessing function
recipe_df['method_cln'] = recipe_df['method'].apply(text_prepro)

recipe_df.head()

***
# Text EDA

Exploratory Data Analysis or EDA for short is standard practice in data science and analysis projects. As the name implies it is the discovery and exploration of the data. For text data this can mean taking a look at some indicative statistics that give an idea about the data we're about to work with. Here I will only show a table and a simple graph.

## TF, DF and Others
The most basic step to evaluate the preprocessing of text is to check the cumulative term-frequencies and document-frequencies. If for example you find that the term ```them``` is at the top of the lists of TF and DF then you would probably want to add it to your stopwords; although this is unlikely.

The table below the code block contains the cumulative term frequency accross all documents (_frequency of the keyword in all of the documents_), the document frequency (_number of documents the keyword appeared in_) and a Tf-IDf score which is a weighting system that gives a higher score for more important keywords (there is __a lot__ to it more than that but I won't get into it now).

In [None]:
keyword_df = tfidf_model_maker(recipe_df['method_cln'])
keyword_df[:30]

keyword_df = tfidf_model

In [None]:
terms = keyword_df['keyword'][:30]
tf = keyword_df['term_frequency'][:30]
df = keyword_df['document_frequency'][:30]

sns.set_style('white')

graph = plt.figure(figsize=(15,5))
graph.add_axes()
subplot = graph.add_subplot(111)
subplot.bar(terms,tf)
subplot.bar(terms,df)
subplot.set(
    title = 'Tf/Df for top 30 words by TfIdf',
    xlabel = 'counts',
    ylabel = 'term'
           )
subplot.legend(['Tf','Df'])
subplot.set_xticklabels(terms,rotation=25)
graph.show()

***
# The LDA Topic Model

Latent Dirichilet Allocation or LDA is a method for extracting topics from a set of documents. Shown below are the extracted topics. The number of topics in the model can be changed using the ```n_topics``` parameter of the ```lda_model_maker``` function.

In [None]:
lda_model = lda_model_maker(recipe_df['method_cln'], n_topics=20)

## The Topics

The code cell below creates a DataFrame that makes it slightly easier to read the words contained in the topics. Change the number inside the ```show_topics``` function to match the ```n_topics``` used above.

In [None]:
lda_df = pd.DataFrame(
    data=[(topic_num, words) for topic_num, words in lda_model.show_topics(20)],
    columns=['topic_no','words']
)
lda_df.set_index('topic_no',inplace=True)
lda_df

## Visualizing the Topics

Here the ```pyLDAvis``` package is used to visualize the topics in the LDA model.

In [None]:
pyLDAvis.display(lda_vis)