# Recipe Text Analysis

## Introduction

This analysis is an excercise on common text mining and preprocessing methods. The end goal here is to cover different scenarios where text analysis can be beneficial and fun! The code is written in a style that emphasizes readability (sticking to Python's ethos) and also provide a base on which it can be expanded upon.

Since my experience with object-oriented programming is minimal I preferred a more functional style of writing code. There are possibly more efficient ways of producing the same results; mainly in regards to memory independence which I chose to forego in this iteration.

***
## The Data

The recipe data was scraped from sources online. Surprisingly enough, I didn't find many readily available data-sets for recipes. I chose recipes for the following reasons:
1. The text of the method of preparation itself is on the medium to small scale in terms of length. Which not only means it won't fry my machine when trying to train any models but also will produce quicker results.
2. Because of the coherent theme, they offer more interesting opportunities for analyzing the text
3. As is, the data is relatively clean, no weird unicode characters, no missing data, no misplaced delimiters, although as I will demonstrate it will require some preprocessing.
4. It's _fun!_

By choosing scraping over getting clean data I can sink my teeth into the preprocessing of them, in addition to having a light introduction to web scraping. There are in total __2794__ recipes, and I collected the recipe title, ingredients and method. They are stored in a csv file and without any preprocessing done.

***
## The Preprocessing

Arguably the most important step in any text analysis project. Depending on the expected outcomes the preprocessing steps can differ from one project to another, however some of the methods here are almost universal in their nature that they can be applied ubiquitously such as normalizing the casing. The main goals of preprocessing the text are to:
1. Break down the text documents into units, commonly referred to as tokens.
2. Reduce the number of _unique_ tokens
3. Remove _unnecessary_ tokens

The order by which these goals are achieved becomes more important as the size of documents becomes larger and having optimized processes is no longer a luxury. Moving forward, the first of the aforementioned goals is typically the last step in the process, wherein a string of text (in the recipes case it is a long text string of the method) is split into single tokens of text; that can be one, two or three word combinations. The second goal of reducing the tokens is achieved by:

1. Lower casing all the texts
2. Replacing different word forms with a single one through:
    * Lemmatization: Where words are replaced with their lemma or root form. For example, lemmatizing `developed developers developing developments` will result in `develop developer develop development`
    * Replacement dictionaries: Where specific words are replaced with a single unique form. For example, this text ``` cinnamon, cinnamon powder, ground cinnamon``` becomes ```cinnamon, cinnamon, cinnamon```

The reasoning here is to condense as much of the text as possible to allow for better representation of the semantics of the text instead of having multiple word forms, all conveying the same meaning. In the above example, all 3 different forms of cinnamon can be combined to one form without losing the significance of the term.

Achieving the third goal of removing unnecessary tokens is achieved through:
1. Removing highly common non-specific words, also known as stop-words
2. Removing any unwanted characters (specific for this analysis) such as repeating spaces, numerical characters and single alphabetical characters