# Tutorial 1: Data cleaning & visualization 

---

## Introduction

Welcome!, this tutorial will show you how to visualize apples infrared spectroscopy samples using python. From this tutorial you will learn:

 - how to read data into python from an Excel file
 - how to use dataframes (pandas package)
 - how to visualise infrared data
 - how to perform data standardization

For this tutorial, we have three kinds of apples namely Golden Delicious (`GD`), Granny Smith (`GS`), and Royal Gala (`RG`). The general purpose is to use infrared spectrum data to classify between bruised (`B`) sound (`S`) samples.

All tutorials will use `GS` data, while participants have to solve the exercises on the other two data sets.

---

First we import some libraries:

In [5]:
# ___Cell no. 1___

import pandas as pd # for importing data into data frame format
import seaborn as sns # For drawing useful graphs, such as bar graphs
from matplotlib.pyplot import show # This displays graphs once they have been created

The above statements define the prefixes 'pd' and 'sns' which will be used to identify pandas and seaborn functions respectively in the following code.

---

<b><i> Reading in data </i></b> 

The following code does the following:
- reads data from an Excel file
- converts the Excel file format into a Pandas dataframe 

In [15]:
# ___Cell no. 2___
import os 
# 
df = pd.read_excel(os.path.abspath('../data/Detect-GS.xlsx')) # change the directory as needed

[**hint**](https://www.geeksforgeeks.org/python-os-path-abspath-method-with-example/): since the excel files lives in a sibling directory `../apple_classification/data` we have to use `os.path.abspath` as it returns the absolute path of current working directory with file name `../data/Detect-GD.xlsx`

---

<b><i> Examining data </i></b> 

First let's take a look at the raw infrared data

In [19]:
# ___Cell no. 3___
df.head(5) # shows the first 5 rows of the data frame

Unnamed: 0,Sample,Condition,Age,Source,11995.49,11991.63,11987.78,11983.92,11980.06,11976.21,...,4034.497,4030.64,4026.783,4022.926,4019.069,4015.211,4011.354,4007.497,4003.64,3999.783
0,GD-ch-bruise1.5h-10a,B,1h,S1,-0.083126,-0.082581,-0.082173,-0.081704,-0.081251,-0.080829,...,1.208914,1.216652,1.219303,1.207366,1.191071,1.185219,1.183722,1.175261,1.168796,1.191991
1,GD-ch-bruise1.5h-10b,B,1h,S1,-0.154684,-0.154762,-0.154668,-0.154153,-0.153504,-0.153067,...,0.744595,0.745167,0.743545,0.744555,0.750424,0.752385,0.752032,0.755532,0.755115,0.747916
2,GD-ch-bruise1.5h-10c,S,1h,S1,-0.066006,-0.065688,-0.0652,-0.064603,-0.064006,-0.063497,...,1.443587,1.456797,1.474139,1.478318,1.455842,1.425429,1.414297,1.446042,1.510794,1.53462
3,GD-ch-bruise1.5h-10d,S,1h,S1,-0.110366,-0.110041,-0.109542,-0.109117,-0.108661,-0.108094,...,1.257423,1.262108,1.269531,1.262279,1.24315,1.235391,1.237499,1.246332,1.26553,1.268394
4,GD-ch-bruise1.5h-11a,B,1h,S1,-0.142115,-0.141852,-0.141603,-0.141129,-0.140701,-0.140477,...,0.697953,0.696903,0.69935,0.704406,0.707838,0.709304,0.710684,0.711052,0.707295,0.703002


In the above dataframe, the rows correspond to different apple `GS` samples, while the columns give the values of 2078 variables, which can be explained as follows:
- Sample:
- Condition: Bruised or Sound apple
- Age:
- Source:
- 11995.49,...,3999.783: infrared data

@@@ ask Fred to give you more info here.

In [17]:
# ___Cell no. 4___


df_shape = df.shape # "df_nf.shape" produces a tuple of 2 numbers 
print("the shape of the nonfermented data is "+str(df_shape) ) 

# The individual numbers in the tuple are accessed as follows:
print("where " + str(df_shape[0]) +" is the number of rows, and")
print(str(df_shape[1]) +" is the number of columns")

the shape of the nonfermented data is (547, 2078)
where 547 is the number of rows, and
2078 is the number of columns


This shows that we are working with high dimensional data, one of the major tasks is to reduce the data, this can be done manually using feature engineering methods, or automatically using deep learning. However, given the small amount of data we will be focusing on using feature engineering methods, this will be explored moor in tutorial 2.

---

---
**Exercise 1:** Please display the first 5 elements and the shape of the two other data sets (GD, RG)
<br>


In [18]:
#  ___ code here ____


---

<b><i> Cleaning data </i></b> 