**Section 1: Getting to know your data**

Notebook for "Introduction to Data Science and Machine Learning"

Version 1.1, 15 April 2025

# Getting to know your data

## 1. Data Understanding (CRISP-DM)

In this section we look at methods to get to know the data, i.e. to look at data, get an overview of its statistics and plot data. We equally check whether there are missing values.

The purpose of **Data Understanding** is the overview and the documentation of any possible problems. Those are solved in a later step (data preparation). Please make always sure never to modify the original data without having copies.

## 2. Introduction

In this notebook you will get to know a basic data structure for data analysis and machine learninge as well as some usefull functions and methods.

A data frame is the basic data structure in the Python module `pandas`. A data frame is a two-dimensional structure that can store data of different data types such as characters, integers, floating point values, and categorical data. One column is named a "{}data series"{}. 

The size of data frames can be modified, columns and rows can be added and deleted. Columns have labels that can be used to access data.

Data frames can be created inside code. In this assignment we will use existing data.

Often data is stored in **csv files**. In csv files the column values are separated by commas (thus the name: comma separated values) and rows by a new line. Spreadsheet applications such as LibreOffice Calc and Microsoft Excel can read csv files and store data as csv files. 

csv files may have a header that displays the column name. A csv file can be read using the function `read_csv()` in Pandas that takes the file name as argument.

Some basic data sets are often equally provided with modules. In this notebook we will use the Iris data set, that was equally presented in the lecture. This data set is ofte used for basic classification tasks. We will use it to introduce some basic functionality. 

## 3. Importing Required Python Modules

We will need some basic modules:

- `pandas` implements the class DataFrame that we get to know in this notebook
- `seaborn` provides statistical plots and some basic data sets like the iris flower data set, and
- `matplotib.pyplot` provides basic (MATLAB like) plotting functionality.

We will thus import all three modules first.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 4. Loading the data

We will now load the iris data set and store it in a variable `iris`.

In [None]:
iris=sns.load_dataset('iris')

And quickly display the variable.

In [None]:
iris

**Question:**

How many rows and columns does the `iris` data frame have? What are the types of the columns? Can you make a guess?

**Answer:**



## 5. Getting to know the data

`iris` is a `pandas` data frame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas data frames offer many functionality in form of methods. Methods are like functions that are bound to specific objects. 

Methods are called as `variableName.method(parameter list)`.

You have already seen methods of lists. Let `theList` be an object of the `list` class, then the method `append()` is called as 

`theList.append(12)`.

In the following a list of basic methods of pandas data frames is specified. Please call the methods in the code cells for the variable `iris`, observe the output and make a note in the cell (in form of a comment). If you have problems with the methods, you find the pandas documentation at: https://pandas.pydata.org/docs/reference/frame.html

- the method `head()` to **display the top 5 rows**. If you pass an integer value `n` as parameter (`head(n)`), the top `n` rows are displayed. 

In [None]:
# call the method for the iris data frame


- the method `tail()` to **view the bottom rows**. If you pass an integer value `n` as parameter, the bottom `n` rows are displayed. 

In [None]:
# call the method for the iris data frame


In [None]:
# call the method for the iris data frame to display the bottom three rows


- Attributes of objects are accessed using the variable name and a dot followed by the attribute name: `variableName.attributeName`.
<p>Display the following attribute values (using the `print()` function) and describe their values in a comment:</p>

    - `index`
    - `columns`
    - `shape`
    - `axes`


In [None]:
# display the values of the attributes


- the method `info()` summarizes information about the data frame.

In [None]:
# call the method for the iris data frame


- the method `describe()` gives a statistical overview of the numerical data.

In [None]:
# call the method for the iris data frame


## 6. Accessing data in the data frame

Data Columns (series) can be easily accessed using the column name like an attribute:

In [None]:
iris.sepal_length

The column name can equally be used inside `[]` to access the series. Don't forget to use quotation marks around the column name.

In [None]:
iris['sepal_length']

`loc` can be used to access single elements (rows) or to create slices:

In [None]:
#second row
iris.loc[1]

In [None]:
#second row
iris.loc[1,'sepal_length']

In [None]:
# the firsth three rows
iris.loc[0:2]

Please note that the slice indices are **inclusive**.

In [None]:
# rows 10 to 15, petal_length and petal_width
iris.loc[10:15,['petal_length', 'petal_width']]

We can use boolean values in an index vector to access data.
The third quartile, Q3, of sepal_length is 6.4. Let's create a boolean array that has `True` if the length is > 6.4, `False` otherwise:

In [None]:
booleanArray=iris.sepal_length>6.4

In [None]:
print(booleanArray)

We can now use this array to display/access only the rows where sepal_length is > 6.4:

In [None]:
iris.loc[booleanArray]

or store the result in a new data frame:

In [None]:
largeSepalLength=iris.loc[booleanArray]

Let's check by displaying the statistics:

In [None]:
largeSepalLength.describe()

With `iloc` access to columns and rows is possible in the same way as with `numpy` arrays: 

In [None]:
iris.iloc[1:3,:]

Please note, the (end-)index in `iloc` is **exclusive**!

Values in the data frame are accessed with `iloc` using the index. 

By contrast the (end-)index in `loc` is **inclusive**!

In [None]:
iris.loc[1:3,:]

`loc` enables the access to columns using columns names also.

In [None]:
iris.loc[1:3,'sepal_width']


Use lists to access different columns by names or equally specific rows:

In [None]:
iris.loc[1:3,['sepal_width','petal_width']]


In [None]:
iris.loc[[2,4,6,8],['sepal_width','petal_width']]


## 7. Calculations

It is possible to use arithmetic operations in data frames. As the `iris` data frame is very large, let's shortly take a look at the example of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.div.html.
In the example you see how a (small) data frame can be defined directly.

In [None]:
df = pd.DataFrame({'angles': [0, 3, 4],
                   'degrees': [360, 180, 360]},
                  index=['circle', 'triangle', 'rectangle'])

In [None]:
df

Just let's test some arithmetic operations:


In [None]:
df+1

In [None]:
df-1

In [None]:
df*[2,0.5]

In [None]:
df+[-1,-100]

Please check out the documentation for more detailed information about the operations as well as how to apply the opertions on the different axes.

## 8. Empty cells

Please load the file `NATest.csv` as follows:

In [None]:
data=pd.read_csv('../NATest.csv')

In [None]:
data

Let's taka a look:

In [None]:
data.info()

`NaN` indicates that the respective cell has no value, is empty. Methods like `sum()`, `mean()`, `median()`, `cumsum()` etc. (please check the documentation for a complete list of the functions) can deal with these values and will work only on existing data:

In [None]:
data['Value 1'].sum()

In [None]:
data['Value 1'].cumsum()

In [None]:
data['Value 1'].mean()

In [None]:
data.describe()

To test a value for `NaN`, the comparison operator `==` is not working. You need to apply the method `isna()` or `notna()`:

In [None]:
data['Value 1'].isna()

In [None]:
data['Value 2'].notna()

We can thus count the `NaN` values by simply using the `sum()` method:

In [None]:
print('Value 1 has',data['Value 1'].isna().sum(),'NaN values')

## 9. Plots

The examples of this section base on https://pandas.pydata.org/docs/user_guide/visualization.html#visualization-scatter. Check out more examples there.

Let's go back to the iris data set:

In [None]:
iris=sns.load_dataset('iris')
iris.columns

A small scatter plot of sepal length versus sepal width:

In [None]:
iris.plot.scatter(x='sepal_length',y='petal_length')

Instead of a 3-D plot the value of a third dimension, here the petal width, can be added as color coding.

In [None]:
iris.info()

In [None]:
iris.plot.scatter(x='sepal_length',y='petal_length',c='petal_width')

We can also use a third value to control the size of the points:

In [None]:
iris.plot.scatter(x="sepal_length", y="sepal_width", s=iris["petal_width"] * 10);

In order to color the dots according to the classes we need a categorial column for the class. In order not to modify the data, we copy our data frame and modify the copy:

In [None]:
myIris=iris.copy()
myIris["class"] = myIris["species"].astype("category")
myIris.plot.scatter(x='sepal_length',y='petal_length',c='class',cmap="viridis")

We can easily create boxplots for the different attributes.

In [None]:
iris.plot.box()

Or a scatter matrix:

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(iris, alpha=0.2, figsize=(6, 6), diagonal="kde");

Seaborn offers additional plotting possiblities. A `hue` parameter allows for coloring the dots according to the class. As we need a categroical attribute, we use our copies data frame `myIris` with the newly created colum:

In [None]:
myIris=iris.copy()
myIris["class"] = myIris["species"].astype("category")
sns.scatterplot(x=myIris.sepal_length,y=myIris.sepal_width,hue=myIris['class'])

## 10. Exercise: The penguin data

Load the `penguin` data set. Read about it in the internet and use some of above mentioned functions to understand andn plot the data.

In [None]:
penguins=sns.load_dataset('penguins')

# your code.... enjoy!

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.