In [None]:
#How Good Is This Wine?  - Using A.I For Quality Control
Hello everyone, and a warm welcome to this interactive tutorial where we´ll be exploring how we can use some pretty simple A.I to automate our Quality Control process, saving us time and money.

As always, our mission here is to show people in all types of business, that these A.I  tools are available to them and that they can use them without any technical knowledge. You don´t need to know the code behind Microsoft Word in order to use it, right? I hope to convince you that the same applies here.

##Introducing the data

So, for today, say  hello to our dataset: a list of 5,000 white wines with information about their chemical properties, things like sulphates, chlorides, PH levels, etc. (you can have a look at the complete spreadsheet [here](https://docs.google.com/spreadsheets/d/1YU2sqcuG4_DAzYvD-uLbp5NHg21S51CxILCq9xPaOsE/edit#gid=1916199166))

Right, so why on earth would that be of any use to us?

Imagine the following scenario: Let´s say we are medium-sized wine distributor, we receive wines from producers, store them and then supply them to liquor stores around the country or even sell them online.  Let´s say that we carry out on a daily basis random quality control tests on samples of wines we receive from the producers. The spreadsheet we are going to be working with today is comprised from all these past tests that we have carried out. Let´s have a look.

In [0]:
#We import the library of pandas ( remember pandas is like Excel, but after having taken some sort of illegal steroids)
import pandas as pd
#I previously uploaded our data to this link
data_url=("https://github.com/busyML/Wine-Quality-Control/blob/master/winewhite.xlsx?raw=true")

#We load our data from that link to Pandas 
data = pd.read_excel(data_url)

#We print out the first 20 rows of our data to visualize what we are working with here
data.head(21)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0


This dataset, I have to admit, doesn´t look particularly interesting. This is probably because this is just a set of features **without** any conclusions made. For example, in many datasets, we would have a set of columns like this but also an additional column that would maybe give some sort of classification about the wine (for example, a column that says "good wine/bad wine"). The name of the game in those cases is to create a model that is able to classify the wines as good or bad, based on the past labels. These sort of predictions are possible when we have previously collected **answers**  .

However, no such luck with this dataset here. We have no indication we could even try to predict. More often than not, datasets in the real world are like this. So what can we use this datasheet for? Well, as we go through step by step, we will still be able to do some very interesting things such as:


*   **Anomaly Detection**: Our model will be able to detect if something seems to be wrong (for example if the wine is corked has gone off) with a new given wine.

*   **Categorize and order our data for us**: Our model will find categories within our dataset and then tell us which category a new wine belongs to. This can be extremely useful for pricing for example.

So, first things first, we need to clean our data, let´s get mopping.



####Step 0 -Importing libraries

First of all, as always we need to pre-load the libraries to make our lives a hundred times easier. This is always the first thing for us to do. Consider it as our **Step 0**.

Without these libraries, we wouldn't be able to use any of the easy commands. These libraries use open source code created by other people, allowing us to execute complex operations with commands that are only a few words long. I always import all common libraries whether I end up using them or not just so that I don't need to worry about any of this later on. 

*To execute the code and follow along, simply press the *"Play"* button at the top left-hand side of the code.*


In [0]:
import numpy as np # This library allows to easily carry out simple and complex mathematical operations.
import matplotlib.pyplot as plt #Allows us to plot data, create graphs and visualize data. Perfect for your Powerpoint slides ;)
import sklearn #The one and only. This amazing library holds all the secrets. Containing powerful algorithms packed in a single line of code, this is where the magic will happen.
import sklearn.model_selection # more of sklearn. It is a big library, but trust me it is worth it.
import sklearn.preprocessing 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, explained_variance_score,mean_absolute_error,mean_squared_error,precision_score,recall_score, accuracy_score,f1_score
from sklearn.utils import shuffle
import pandas as pd
from pandas.plotting import radviz


import random # Allows us to call random numbers, occasionally very useful.
from google.colab import files #Allows to upload and download files directly from the browser.
import pprint#Allows us to neatly display text
from collections import OrderedDict



#Unsupervised Learning
from sklearn.neighbors import NearestNeighbors,LocalOutlierFactor
from sklearn.cluster import KMeans


## I - Data cleaning

So it looks like we´ve been blessed with a relatively clean dataset. First of all, we can see that we are purely dealing with numerical values, which is fantastic news for our algorithms. If we had any texts or categories in our columns, we would have to convert these to some sort of number encoding.

There is one thing that we need to take care of however...



###Step 1- Everyday I´m shuffling...

So this is extremely simple. We want to shuffle our data. Why is this? Because if the data is not shuffled, then some unwanted bias might slip in. The most common, for instance, would be if the data was entered in chronologically. Now imagine that we were buying cheaper and less palatable wines 5 years simply because back then our company was smaller and we had less resources. The model might pick up on this and could think that the year in which we purchased the wine in could be indicator of how good the wine is, and if we gave it a new wine to sample, it might overestimate its quality simply because we purchased it in 2019.

It is always a best practice to shuffle our data, and there is no excuses for not doing so, especially when it takes only one line of code:


In [0]:
#we use the "sample" command of pandas to shuffle our data, the random state means that we will always shuffle the data in the same way so that when different people load this code, they will all get the same results.
data= data.sample(frac=1, random_state=85)

#we print out the first 20 rows of our data to check that it has indeed been shuffled, on the left we have the index number which we can also think of as an ID number.
data.head(21)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
1880,7.7,0.3,0.42,14.3,0.045,45.0,213.0,0.9991,3.18,0.63,9.2
2724,7.5,0.18,0.31,6.5,0.029,53.0,160.0,0.99276,3.03,0.38,10.9
462,5.9,0.25,0.19,12.4,0.047,50.0,162.0,0.9973,3.35,0.38,9.5
3929,6.6,0.32,0.47,15.6,0.063,27.0,173.0,0.99872,3.18,0.56,9.0
1240,7.9,0.14,0.28,1.8,0.041,44.0,178.0,0.9954,3.45,0.43,9.2
1223,8.0,0.28,0.42,7.1,0.045,41.0,169.0,0.9959,3.17,0.43,10.6
1058,7.5,0.21,0.34,1.2,0.06,26.0,111.0,0.9931,3.51,0.47,10.7
313,5.7,0.36,0.21,6.7,0.038,51.0,166.0,0.9941,3.29,0.63,10.0
2027,6.9,0.32,0.15,8.1,0.046,51.0,180.0,0.9958,3.13,0.45,8.9
3242,7.0,0.29,0.35,1.4,0.036,42.0,109.0,0.99119,3.31,0.62,11.6


# II- Data Learning

###Anomaly Detection

So the first thing we can do with our data, is to create an anomaly detection model. Put simply, we will use an algorithm to scan through all the wines of our current dataset so that it will learns what a "normal" white wine is like. This way, when we give the algorithm the data for a test on a new wine we receive in the warehouse, it will be able to detect whether this wine is normal or if there is something "abnormal", or in everyday language, "fishy" about it.

The algorithm is called **"Local Outlier Factor"**, and here is a visualization of how it works:


![alt text](https://github.com/busyML/Wine-Quality-Control/raw/master/anomaly_comparison_0011.png)


The yellow dots are "normal" samples whereas the blue dots on the edge have been identified as "abnormal"

Let's first load and train our algorithm in a couple of lines of code:

In [0]:
#From The SKlearn library, we can load this handy algorithm called "Local Outlier Factor", we'll call it lof for short from now on.
lof = LocalOutlierFactor(novelty=True)

#Using the ".fit" command, we are ordering our algorithm to learn from our data what a normal white wine should be like.
lof.fit(data)

print("Learning Done!")



Learning Done!




There you go! In half a second, our model learnt what a good wine should look like. Now let's put it to the test by giving it three new wines it has not seen before, we'll name them **Wine_1**, **Wine_2** and  **Wine_3**. 
So imagine the scene, it's* 5:30 AM* and you are receiving the first order of the day. You take out three bottles of white wine and carry out a chemical test for quality control purposes. Once you have their data, you can very easily use the algorithm to see if everything is ok with this batch. You run the test.

In [0]:
#Chemical Data for Wine 1
wine_1= [[6.8,0.32,0.16,7,0.045,30,145,0.9949,3.18,0.47,9.6]]

#Chemical Data for Wine 2
wine_2 = [[7.6,1.58,0.0,2.1,0.136,5.0,9.0,0.99476,3.5,0.4,10.9]]

#Chemical Data for Wine 3
wine_3=[[5.2,0.37,0.2,7.6,0.046,35,110,0.9954,3.29,0.58,9.6]]


#We can use the ".predict" command to ask the algorithm to detect an anomaly, if it outputs "1", the wine is normal, if not it will output a "-1"


#We can then create a simple "if/else" condition that will give us the outcome in plain English.
print('for wine 1 :')
if lof.predict(wine_1)==1:
  print("This wine is normal, it passes quality control.")
else:
  print("Abnormal wine detected! Human checking is needed on this one!")
  
print('for wine 2 :')
if lof.predict(wine_2)==1:
  print("This wine is normal, it passes quality control.")
else:
  print("Abnormal wine detected! Human checking needed !")
  
print('for wine 3 :')
if lof.predict(wine_3)==1:
  print("This wine is normal, it passes quality control.")
else:
  print("Abnormal wine detected! Human checking is needed on this one!")
  
  
  
  


for wine 1 :
This wine is normal, it passes quality control.
for wine 2 :
Abnormal wine detected! Human checking needed !
for wine 3 :
This wine is normal, it passes quality control.


Ok, so two out of our three wines passed. So, what is going on with wine #2 ? As the machine tells you to, you decide to pick up the bottle and give up it a closer look... and you realize you accidentally grabbed a bottle of red wine. A clear brain fart on your part, but you console yourself that it is 5:30 AM after all and you are happy that nobody was around to  see you. You quietly bless the algorithm that saved you from that an embarrassing moment where you have to aplogize to a customer and go off to make yourself a well-earned coffee.


---

So, in summary, this algorithm was able to flag up anything strange in our test samples. In this case, the data for wine #2 is genuinely taken from red wine (as any oenologist will be quick to attest) so we would expect that the algorithm detects it as an abnormal white wine!

A final point, it is important to understand that the algorithm is not saying that the wines are "good" or "bad", but simply "common" or "very uncommon". Say we had an exceptionally high quality wine, our algorithm might flag it up as an abnormal, as in “abnormally delicious”. Whenever the algorithm finds an anomaly, it is down to the human to double check that anomaly and investigate as to what is so strange about it.

So that was anomaly testing, but what else can be done with our data?


###Categorization

As we mentioned at the beginning, our data is not categorized in anyway. However, we can use an algorithm to delve into the data for us and find commonalities between different wines in order to form categories  (this is also known as clustering). This is incredibly useful since it means we can transform a big bulk of unsorted data into a neat set of categories. We can then use those categorizes for lots of things, such as market research (finding different groups of customers of our business), for technical support (what are the different sort of queries received), or for re-organizing our produt range (like in our case with wine). I could go on and on with the use cases.

So the algorithm we will use for this is called **K-means**, and here is a picture to help you to visualize what it actually does: 

![texto alternativo](https://github.com/busyML/Wine-Quality-Control/raw/master/clustering_image.jpg)

As you can see above, the algorithm takes the data from the left and on the right finds three distinct groups, represented by the different colors. Each point is a different sample however when two points are the same color, we can think of them as being similar and thus belonging to the same category. This is how we can categorize our data!

The great thing here is that we can choose how many categories  we want. We could ask our algorithm to separate our data into 20 categories  or just 2. This decision will be based on our business context. As our dataset is not large and all wines are not different from one another anyway (sorry if I´ve offended anyone), I´m going to ask our algorithm to divide our wine data into **three categories**

So as always, we´ll use just a few lines of code to load and use this algorithm on our data:

In [0]:

#we import the kmeans algorithm from sklearn
kmeans= KMeans(n_clusters=3)

#we use the ".fit" command to use the kmeans algorithm on our data
kmeans.fit(data)

#We create a new column in a data spreadsheet that records for each wine the category it was given
data['category']= kmeans.fit_predict(data)

#prints out the different categories we have and the number of wines that were assigned to it
data['category'].value_counts()

2    1977
0    1796
1    1125
Name: category, dtype: int64

Ok, done, so our data has been separated into three distinct groups! For now, the name of these categories is pretty boring:

* **Category 0:** has 1977 wines.
* **Category 1:** has 1796 wines.
* **Category 2:** has 1125 wines.

We’ve also added this categorization to our original datasheet in an extra column, as you can see below:

In [0]:
data.head(21)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,category
1880,7.7,0.3,0.42,14.3,0.045,45.0,213.0,0.9991,3.18,0.63,9.2,1
2724,7.5,0.18,0.31,6.5,0.029,53.0,160.0,0.99276,3.03,0.38,10.9,2
462,5.9,0.25,0.19,12.4,0.047,50.0,162.0,0.9973,3.35,0.38,9.5,2
3929,6.6,0.32,0.47,15.6,0.063,27.0,173.0,0.99872,3.18,0.56,9.0,2
1240,7.9,0.14,0.28,1.8,0.041,44.0,178.0,0.9954,3.45,0.43,9.2,1
1223,8.0,0.28,0.42,7.1,0.045,41.0,169.0,0.9959,3.17,0.43,10.6,2
1058,7.5,0.21,0.34,1.2,0.06,26.0,111.0,0.9931,3.51,0.47,10.7,0
313,5.7,0.36,0.21,6.7,0.038,51.0,166.0,0.9941,3.29,0.63,10.0,2
2027,6.9,0.32,0.15,8.1,0.046,51.0,180.0,0.9958,3.13,0.45,8.9,1
3242,7.0,0.29,0.35,1.4,0.036,42.0,109.0,0.99119,3.31,0.62,11.6,0


####Naming the categories
Right, so here comes the **crucial** part, the human added value. The **Kmeans** algorithm grouped these data points for us into three groups but it is unable to tell us why! All that we know is that there is some sort of similarity between the wines within each group, but it is down to us to establish what this similarity actually is. What we can do for now is to display the ID number of 20-30 wines of each category and find out for ourselves what they all seem to have in common:

In [0]:
#We initialize a list of empty lists that will later contain the wines of each category
category_0=[]
category_1=[]
category_2=[]

#this function will sort the first 100 wines of our spreadsheet based on what category they belong to.
for i in range (100):
  if (data.iloc[i]['category'])==0:
    category_0.append(data.index[i])
  if (data.iloc[i]['category'])==1:
    category_1.append(data.index[i])
  if (data.iloc[i]['category'])==2:
    category_2.append(data.index[i])

#Let´s print out the id number numbers belonging to each category.    
print(len(category_0),"wines in category 0:",category_0)

print(len(category_1),"wines in category 1:",category_1)

print(len(category_2),"wines in category 2:",category_2)
                      

37 wines in category 0: [1058, 3242, 3736, 3225, 4133, 1391, 1668, 2605, 996, 4294, 1957, 2888, 4777, 4653, 1166, 3015, 3558, 330, 2631, 2566, 1349, 2061, 4669, 1015, 3071, 3986, 3323, 29, 3646, 3584, 3829, 3390, 3088, 3090, 566, 4842, 793]
20 wines in category 1: [1880, 1240, 2027, 3010, 1918, 2633, 3326, 500, 1734, 1117, 1789, 282, 1856, 3487, 620, 2789, 4501, 2824, 1947, 3752]
43 wines in category 2: [2724, 462, 3929, 1223, 313, 4687, 4524, 342, 2287, 3892, 407, 4601, 4698, 1364, 158, 3103, 1075, 2265, 1787, 997, 3886, 2820, 4780, 1529, 2657, 2749, 3074, 3984, 4846, 1673, 3729, 4594, 1420, 339, 2230, 509, 2206, 3126, 3469, 4347, 3834, 2203, 1242]


Great, so from the first 100 wines of our list, we what we have above are the id numbers for each wine and in which category they are. 


So now here is the fun part. Now that we have this list of wines that supposedly share something common, we let’s physically go to our warehouse and pull out the wines listed it out for each category and see what they all have in common. It could be all sorts of things, maybe the wines from each category are from a different type of grape (*Sauvignon*, *Pinot*, etc.), or maybe they come from a different region (*South America*, *North America*, *Europe*) or maybe they are from large producers or small producers. Whatever it is, there should be some sort of commonality that jumps out at you straight away and that explains why the algorithm separated the data in this way.

Now let´s imagine that in our case, after a rather fun tasting session that dragged on late into the night (and the effect of which may explain why you previously tested red white by mistake), it becomes clear to us that the algorithm separated our data in terms of quality. Rather impressive, and this is fantastic, because quality is related to price. This means that our algorithm is capable, in our fictional example, of discriminating by taste quality and we can use this to set our pricing for our wine. So we can now automate our pricing process!

So after the hours of tasting, we are able to give the following labels to our wine categories:


* **Category 0**: Average, mid-range wine, usually priced between $30-50

* **Category 1:**  Low range wine usually priced under $30

* **Category 2:** High-end, great quality wines, usually priced over $50



So now that we´ve figured that out, let´s just update our spreadsheet to change the category number we have for each wine from "0, 1, 2" to something a bit more meaningful like ** "Low price", "Medium price", "High price"**.


We can quickly do so  with the following code:

In [0]:
#Here we use a short function to convert the categories numbers to plain English labels that we´ll be able to understand.

data['category'] = data['category'].apply(lambda x:"High Price" if x==2 else x)
data['category'] = data['category'].apply(lambda x:"Medium Price" if x==0 else x)
data['category'] = data['category'].apply(lambda x:"Low Price" if x==1 else x)

#We print the top part of our dataset to observe the changes
data.head(21)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,category
1880,7.7,0.3,0.42,14.3,0.045,45.0,213.0,0.9991,3.18,0.63,9.2,Low Price
2724,7.5,0.18,0.31,6.5,0.029,53.0,160.0,0.99276,3.03,0.38,10.9,High Price
462,5.9,0.25,0.19,12.4,0.047,50.0,162.0,0.9973,3.35,0.38,9.5,High Price
3929,6.6,0.32,0.47,15.6,0.063,27.0,173.0,0.99872,3.18,0.56,9.0,High Price
1240,7.9,0.14,0.28,1.8,0.041,44.0,178.0,0.9954,3.45,0.43,9.2,Low Price
1223,8.0,0.28,0.42,7.1,0.045,41.0,169.0,0.9959,3.17,0.43,10.6,High Price
1058,7.5,0.21,0.34,1.2,0.06,26.0,111.0,0.9931,3.51,0.47,10.7,Medium Price
313,5.7,0.36,0.21,6.7,0.038,51.0,166.0,0.9941,3.29,0.63,10.0,High Price
2027,6.9,0.32,0.15,8.1,0.046,51.0,180.0,0.9958,3.13,0.45,8.9,Low Price
3242,7.0,0.29,0.35,1.4,0.036,42.0,109.0,0.99119,3.31,0.62,11.6,Medium Price


Looking good. Just for safe keeping, now that we have categorized our wines and given them labels, this is now a pretty handy spreadsheet to have for our employees.


We can download it to our hard drive with the following code:


In [0]:
#We create an excel file that contains the wine with their new categories
data.to_excel("wines with price categories.xlsx")

#We use the ".download" command to download the new excel file to our browser
files.download("wines with price categories.xlsx")

### Classifying new wines.

So what´s even better, is that now when we receive new wines into our warehouse, we can ask our algorithm to classify them based on our previous data. So let´s take **Wine 1** and **Wine 3** (not **wine 2** because it failed the anomaly test, remember), and let´s see how our **Kmeans** algorithm categorizes them as:

In [0]:
#We can use the ".predict" command for this

#A simple condition to interpret the output in plain english
if kmeans.predict(wine_1)==2:
        print("Wine 1 should be high priced (more than $50)")
if kmeans.predict(wine_1)==0:
        print("Wine 1 should be medium priced ($30-50)")
if kmeans.predict(wine_1)==1:
        print("Wine 1 should be low priced (less than $30)")


if kmeans.predict(wine_3)==2:
        print("Wine 3 should be high priced (more than $50)")
if kmeans.predict(wine_3)==0:
        print("Wine 3 should be medium priced ($30-$50)")
if kmeans.predict(wine_3)==1:
        print("Wine 3 should be low priced (less than $30)")
  
  


NameError: ignored

There we have it, for the two wines we inputted, the algorithm has judged **wine** 1 to be average quality wine and therefore it should be put in the medium price bracket.

Meanwhile, **Wine 3** is apparently of higher quality and so should be priced accordingly.

What we have here is now an automated price setting system, something that can tangibly speed up the ops at our wine warehouse.


####Visualization
So we´ve dealt with a lot of numbers so far, now perhaps it´s time for a bit pretty visualization. Let´s try to actually visualize how our algorithm has separated our different wines and see what insights we can draw from it:

For this, we´ll use a graph called a "**Radviz**" plot, which allows us to visualize all our wines (each dot being one of our wines) and see how they relate to our different features and other wines:


![texto alternativo](https://github.com/busyML/Wine-Quality-Control/raw/master/radviz.png)


So, what are we looking at here? Each dot represents a wine from our dataset and they have been color-coded as per the price bracket that our algorithm assigned to them. 

A few things jump out at us just from glancing over this plot:


* **The high priced wines seem to have higher levels of alcohol and sulphates**
*  **Low priced wines have more residual sugar and chlorides**
*  **Medium priced wines seem to have an average level of pH ( which makes sense)**


So those are a few inferences we can make, and we could use this for some business intelligence; i.e if we wanted to avoid receiving low-quality wines from our suppliers, we could tell them that we will not be accepting wines that have over a certain level of *residual sugar*.

Finally, as I´m sure you agree, this plot is not perfect. There seems to be a lot of overlap between the categories for instance. The reason for this is despite this being a 2D plot, you should actually think of it as 3D. Imagine that it is in fact a hill and that we are looking down at it from above. When our data is multi-dimensional it is always hard to visualize. For example our spreadsheet has 11 columns, which means it needs to be represented in ***11-D***. This is impossible for our feeble human brain but thankfully, it is something that computers find very easy to do. As such, the sorts of visualizations like the one above is probably as close as we´ll ever be able to get to visualizing complex datasets like the one we have today.

We have trained our models to **1) detect anomalies** and **2) Sort our wines into price categories** and these were the most important methodologies and concepts I wanted to share with you today.

Our final task will be to implement these models into a short program with which any employee could use. They will be able to obtain useful prediction from this model to speed up their job.

#Data Predicting

So let´s now build a simple and easy to use program that allows our employees at the wine warehouse to input the chemical data of a new wine and obtain some useful information about it.

Unless you are a web developer or programmer, feel free skip the technical explanation, however , I do recommend that you try to use it for yourself and see how the predictions can be carried out in real time. Please press the "*play*" button on the top left-hand side and follow the instructions. This will allow you to start forming an idea of how we can integrate these tools into our everyday office lives.

In [0]:
#We create a program called "wine_categorizer"

def wine_categorizer():
  
  #The first prompt asks the user whether they have data correctly formatted. If not, they will have to enter it manually.
  
  prompt1=input("Do you have the wine data in the following format:[fixed acidity,volatile acidity,citric acid....]? (yes/no)")
  #if that is the case....
  if prompt1=="yes" or prompt1=="Yes" or prompt1=="y" or prompt1=="YES":
    #...we ask the user to simply copy and paste the line of data
    print("ok great! just copy and paste the data below")
    
    inputted_data=(input(":"))
    #This variable changes the user´s input from a string to a numerical list, that we can compute it
    formatted_data=[list(map(float,inputted_data.split(',')))]
  #if not we get the user to input the data manually, one variable at a time
  else:
    
    print("Ok, no problem, let´s do it manually:")
    
    entered_fixed_acidity=float(input("the wine´s fixed acidity:"))
    entered_volatile_acidity=float(input("volatile acidity:"))
    entered_citric_acid=float(input("citric acid:"))
    entered_residual_sugar=float(input("residual sugar:"))
    entered_chlorides=float(input("chloride levels:"))
    entered_free_sulfur_dioxide=float(input("free sulfur dioxide level:"))
    entered_total_sulfur_dioxide=float(input("total sulfur dioxide :"))
    entered_density=float(input("density:"))
    entered_pH=float(input("pH level :"))
    entered_sulphates=float(input("sulphates :"))
    entered_alcohol=float(input("alcohol% :"))
    #formatting the data so it can computed by our algorithms
    formatted_data=[[entered_fixed_acidity,entered_volatile_acidity,entered_citric_acid,entered_residual_sugar,entered_chlorides,entered_free_sulfur_dioxide,entered_total_sulfur_dioxide,entered_density,entered_pH,entered_sulphates,entered_alcohol]]
  #perform anomaly detection on the entered data and save it the variable "anomaly_check"
  anomaly_check=lof.predict(formatted_data)
  
  #if the anomaly check returns a 1, our data is not an anomaly
  if anomaly_check==1:
    print("This wine is normal, it passes quality control.")
    
     #if the anomaly check returns a -1, our data is an anomaly

  else:
    print("Abnormal wine detected! Human checking needed !")
  #If the wine is an anomaly, then we terminate the program early ( no need to proceed to the price categorization.)
  if anomaly_check==-1:
  #Asking the user whether they want to check a new wine. If the answer is "Yes", the program restarts  
    prompt2=input("Would you like to restart program to check another wine?(Yes/No)")
    if prompt2=="yes" or prompt2=="Yes" or prompt2=="y" or prompt2=="YES":
      wine_categorizer()
    else:
      quit()
 #If the wine is deemed normal by the wine, the program moves onto the price category algorithm 
  if anomaly_check==1:
      print("We will now proceed to the price categorization")
      
      #we use the ".predict" to what the price category the inputted wine would be
      price_category_check=kmeans.predict(formatted_data)
      if price_category_check==2:
        print("This wine should be high priced (more than $50)")
      if price_category_check==0:
        print("This wine should be medium priced ($30-50)")
      if price_category_check==1:
        print("This wine should be low priced (less than $30)")
      prompt2=input("Would you like to restart program to check another wine?(Yes/No)")
      if prompt2=="yes" or prompt2=="Yes" or prompt2=="y" or prompt2=="YES":
        wine_categorizer()
      else:
        quit()
  
wine_categorizer()

Do you have the wine data in the following format:[fixed acidity,volatile acidity,citric acid....]? (yes/no)yes
ok great! just copy and paste the data below
:9.5,0.40,0.2,7.6,0.046,35,110,0.9954,3.29,0.58,9.6
Abnormal wine detected! Human checking needed !
Would you like to restart program to check another wine?(Yes/No)y
Do you have the wine data in the following format:[fixed acidity,volatile acidity,citric acid....]? (yes/no)y
ok great! just copy and paste the data below
:8.2,0.37,0.2,7.6,0.046,35,110,0.9954,3.29,0.58,9.6
This wine is normal, it passes quality control.
We will now proceed to the price categorization
This wine should be medium priced ($30-50)
Would you like to restart program to check another wine?(Yes/No)no


#Conclusion

I hope you were able to follow this tutorial. The point that I wish to make here is that even when we have a dataset that is just a jumbled mess of observations, we can still find plenty of ways to extract value from it and use it accelerate our work. I used quality control and categorization here because these take up a lot of time, and yet they are relatively easy to automate. Go have a look out what old spreadsheets you have may have on your hard drive and have a think at how you could implement the methodology and techniques we discussed above for yourself.

As always, I hope I've managed to convince you that Machine Learning is a tool available to everyone and that we can all use it to make our lives more convenient.

And for any questions or help, feel free to drop me an email at [conrad.w.s@gmail](mailto:conrad.w.s@gmail.com).

I'm off to drink some more high priced wine. 