<a href="https://colab.research.google.com/github/eewonz/ml4ai/blob/main/1_pandas_data_structures_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Data Structures Overview
In this section, we will discuss the `Series`, `Index`, and `DataFrame` classes. To do so, we will read in a snippet of the CSV file we will work with later. Don't worry about that part yet, though.

## About the Data
In this notebook, we will be working with 5 rows from the earthquake data collected over September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))

## Working with NumPy Arrays
Let's read in a short CSV file (using `numpy`) for some sample data.

In [1]:
import os
import sys
os.listdir()


['.config', 'sample_data']

In [2]:
cd drive

[Errno 2] No such file or directory: 'drive'
/content


In [3]:
cd MyDrive/Hands-On-Data-Analysis-with-Pandas-2nd-edition/

[Errno 2] No such file or directory: 'MyDrive/Hands-On-Data-Analysis-with-Pandas-2nd-edition/'
/content


In [4]:
cd ch_02

[Errno 2] No such file or directory: 'ch_02'
/content


In [5]:
cd /content/drive/MyDrive/Hands-On-Data-Analysis-with-Pandas-2nd-edition/ch_02

[Errno 2] No such file or directory: '/content/drive/MyDrive/Hands-On-Data-Analysis-with-Pandas-2nd-edition/ch_02'
/content


In [6]:
import pandas as pd

In [7]:
team_data={
    'name':['Haram','Minji','Donghyun','Wonjune'],
    'Hobby':['Game','running','soccer','baseball'],
    'nationality':['korea','korea','korea','korea']
}

In [8]:
team_df=pd.DataFrame(team_data)

In [9]:
team_df.to_csv('team_members.csv',index=False)

In [10]:
loaded_team_df=pd.read_csv('team_members.csv')

In [11]:
print(loaded_team_df.head())

       name     Hobby nationality
0     Haram      Game       korea
1     Minji   running       korea
2  Donghyun    soccer       korea
3   Wonjune  baseball       korea


In [12]:
print(loaded_team_df.describe())

         name Hobby nationality
count       4     4           4
unique      4     4           1
top     Haram  Game       korea
freq        1     1           4


In [13]:
names = loaded_team_df['name']
hobbies = loaded_team_df['Hobby']

In [14]:
loaded_team_df.set_index('name',inplace=True)

In [15]:
import numpy as np

data = np.genfromtxt(
    'data/example_data.csv', delimiter=';',
    names=True, dtype=None, encoding='UTF'
)
data

FileNotFoundError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We can find the dimensions with the `shape` attribute:

In [None]:
data.shape

We can find the data types with the `dtype` attribute:

In [None]:
data.dtype

Each of the entries in the array is a row from the CSV file. NumPy arrays contain a single data type (unlike lists, which allow mixed types); this allows for fast, vectorized operations. When we read in the data, we got an array of `numpy.void` objects, which are created to store flexible types. This is because NumPy has to store several different data types per row: four strings, a float, and an integer. This means we can't take advantage of the performance improvements NumPy provides for single data type objects.

Say we want to find the maximum magnitude&mdash;we can use a **[list comprehension](https://www.python.org/dev/peps/pep-0202/)** to select the third index of each row, which is represented as a `numpy.void` object. This makes a list, meaning that we can take the maximum using the `max()` function:

In [None]:
%%timeit
max([row[3] for row in data])

If we, instead, create a NumPy array for each column, this operation is much easier (and more efficient) to perform. We can use a **[dictionary comprehension](https://www.python.org/dev/peps/pep-0274/)** to make a dictionary where the keys are the column names and the values are NumPy arrays of the data:

In [None]:
array_dict = {
    col: np.array([row[i] for row in data])
    for i, col in enumerate(data.dtype.names)
}
array_dict

Grabbing the maximum magnitude is now simply a matter of selecting the `mag` key and calling the `max()` method. This is nearly twice as fast as the list comprehension implementation when dealing with just 5 entries, imagine how much worse the first attempt will perform on large data sets:

In [None]:
%%timeit
array_dict['mag'].max()

However, this representation has other issues. Say we wanted to grab all the information for the earthquake with the maximum magnitude, how would we go about that? We would need to find the index of the maximum and then for each of the keys in the dictionary grab that index:

In [None]:
np.array([
    value[array_dict['mag'].argmax()]
    for key, value in array_dict.items()
])

The result is now a NumPy array of strings (our numeric values were converted), and we are now in the format from earlier. Also, consider trying to sort the data by magnitude from smallest to largest. In the first representation, we would have to sort the rows by examining the 3rd index. With the second representation, we would have to determine the order for the indices from the `mag` column, and then sort all the other arrays with those same indices. Clearly, working with several NumPy arrays of different data types at once is a bit cumbersome. However, `pandas` builds on top of NumPy arrays to make this easier. Let's start our exploration of `pandas` with an overview of the data structures.

## `Series`
The `Series` class provides a data structure for arrays of a single type with some additional functionality.

In [None]:
import pandas as pd

place = pd.Series(array_dict['place'], name='place')
place

Here are some commonly used attributes with `Series` objects:

|Attribute | Returns |
| --- | --- |
| `name` | The name of the `Series` object |
| `dtype` | The data type of the `Series` object |
| `shape` | Dimensions of the `Series` object in a tuple of the form `(number of rows,)` |
| `index` | The `Index` object that is part of the `Series` object |
| `values` | The data in the `Series` object |

For the most part, `pandas` objects use NumPy arrays for their internal data representations. However, for some data types, `pandas` builds upon NumPy to create its own [arrays](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html). For this reason, depending on the data type, `values` can either be a `pandas.array` or `numpy.array` object. Therefore, if we need to ensure we get a specific type back, then it is recommended to use the `array` attribute or `to_numpy()` method, respectively, instead of `values`.

Now let's see some examples using these attributes.

### Getting the name of the series
The NumPy array held the name of the data in the `dtype` attribute; here, we can access it directly:

In [None]:
place.name

### Getting the data type
A `Series` object holds a single data type. Here it is `'O'` for object.

In [None]:
place.dtype

### Getting the dimensions of the series
Just as with NumPy, we can use `shape` to get the dimensions as `(rows, columns)`. `Series` objects are a single column, so they only have values for the rows dimension.

In [None]:
place.shape

### Isolating the values from the series
This `Series` object is storing its values as a NumPy array:

In [None]:
place.values

## `Index`
The addition of the `Index` class makes the `Series` class more powerful than a NumPy array. We can get the index from the `index` attribute of a `Series` object:

In [None]:
place_index = place.index
place_index

As with `Series` objects, we can access the underlying data via the `values` attribute. Note that this `Index` object is also built on top of a NumPy array:

In [None]:
place_index.values

Here are some commonly used attributes with `Index` objects:

|Attribute | Returns |
| --- | --- |
| `name` | The name of the `Index` object |
| `dtype` | The data type of the `Index` object |
| `shape` | Dimensions of the `Index` object |
| `values` | The data in the `Index` object |
| `is_unique` | Check if the `Index` object has all unique values |

We can check the type of the underlying data, just like with a `Series` object:

In [None]:
place_index.dtype

Same for the dimensions:

In [None]:
place_index.shape

We can check if the values are unique:

In [None]:
place_index.is_unique

With NumPy we can perform arithmetic operations element-wise between arrays:

In [None]:
np.array([1, 1, 1]) + np.array([-1, 0, 1])

Pandas supports this as well, and the index determines how element-wise operations are performed. With addition, only the matching indices are summed:

In [None]:
numbers = np.linspace(0, 10, num=5) # makes numpy array([0, 2.5, 5, 7.5, 10])
x = pd.Series(numbers) # index is [0, 1, 2, 3, 4]
y = pd.Series(numbers, index=pd.Index([1, 2, 3, 4, 5]))
x + y

We aren't limited to the integer indices of list-like structures, and we can label our rows. The labels can be altered at any time and be things like dates or even another column. In chapter 3, we will discuss how to perform some operations on the index in order to change it. Then, in chapter 4, we will use the index for operations merging data and aggregating it.

## `DataFrame`
Having a `Series` object for each column is an improvement over the NumPy representation; however, we still have the same problem when wanting to sort based on a value or grab an entire row out. The `DataFrame` gives us a representation of a table formed from many `Series` objects that form the columns and a shared `Index` object that labels the rows. We can create a `DataFrame` object from either of the NumPy representations we were working with earlier (we could also make a `Series` object for each column, but there is no need to do so):

In [None]:
df = pd.DataFrame(array_dict)

# this will also work with the first representation
# df = pd.DataFrame(data)

df

We can check the type of the underlying data with `dtypes` (note that it is not `dtype` as with `Series` and `Index` objects since each column will have its own data type):

In [None]:
df.dtypes

We can get the underlying data with the `values` attribute. Note that this looks very similar to our initial NumPy representation:

In [None]:
df.values

We can isolate the columns with the `columns` attribute. Notice that the columns are actually an `Index` object just on a different axis (columns are the horizontal index while rows are the vertical index).

In [None]:
df.columns

Here are some commonly used attributes:

|Attribute | Returns |
| --- | --- |
| `dtypes` | The data types of each column |
| `shape` | Dimensions of the `DataFrame` object in a tuple of the form `(number of rows, number of columns)` |
| `index` | The `Index` object along the rows of the `DataFrame` object |
| `columns` | The name of the columns (as an `Index` object) |
| `values` | The data in the `DataFrame` object |
| `empty` | Check if the `DataFrame` object is empty |

The `Index` object along the rows of the dataframe can be accessed via the `index` attribute (just as with `Series` objects):

In [None]:
df.index

As with both `Series` and `Index` objects, we can get the dimensions of the dataframe with the `shape` attribute. The result is of the form `(nrows, ncols)`. Our dataframe has 5 rows and 6 columns:

In [None]:
df.shape

Note that we can also perform arithmetic on dataframes. Pandas will only perform the operation when both the index and column match. Here, we demonstrate addition. Since addition with strings means concatenation, `pandas` concatenated the string columns (`time`, `place`, `magType`, and `alert`) across dataframes. The numeric columns (`mag` and `tsunami`) were summed:

In [None]:
df + df

<hr>
<div>
    <a href="../ch_01/introduction_to_data_analysis.ipynb">
        <button style="float: left;">&#8592; Chapter 1</button>
    </a>
    <a href="./2-creating_dataframes.ipynb">
        <button style="float: right;">Next Notebook &#8594;</button>
    </a>
</div>
<br>
<hr>