<a href="https://colab.research.google.com/github/francodem/effective_pandas_book_lessons/blob/main/effective_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Effective Pandas
#### from Matt Harrison
### Learning pandas - Lessons and more

### Library import

In [2]:
import pandas as pd

### Data structures

The most widely used data structures are the Series and the DataFrame for dealing with array data and tabular data.

A series is a 1D data structure, and a DataFrame is a 2D data structure.

In [3]:
# Making a DataFrame
df = pd.DataFrame(columns=['name','last_name','email'], index=[1,2,3])

In [4]:
# Set attribute on index 1
df['name'][1] = "Franco"
df['last_name'][1] = "Moreno"
df['email'][1] = "franco@hello.com"

In [5]:
# Read all the values from index 1
df.loc[1]

Unnamed: 0,1
name,Franco
last_name,Moreno
email,franco@hello.com


## Series Introduction

A Series is used to model one-dimensional data. The Series object also has a few more bits of data, including an index and a name. A common idea through pandas is the notion of an axis. Because a series is one-dimensional, it has a single axis—the index.

**Fundamentals bases:**

In [6]:
# Making a series DS
series = {
  'index': [0, 1, 2, 3],
  'data': [145, 142, 38, 13],
  'name': 'songs'
}

In [7]:
# A method to access to a series element
def get(series, idx):
  value_idx = series['index'].index(idx)
  return series['data'][value_idx]

In [8]:
# Getting the element
get(series=series, idx=1)

142

In [9]:
# Applying other abstraction with songs
songs = {
  'index': ['Paul', 'John', 'George', 'Ringo'],
  'data': [145, 142, 38, 13],
  'name': 'counts'
}

In [10]:
get(series=songs, idx='John')

142

**Now using a Pandas Series:**

In [11]:
# Making the Series
songs2 = pd.Series([145, 142, 38, 13], name='counts')

In [12]:
# Printing all details from songs2
songs2

Unnamed: 0,counts
0,145
1,142
2,38
3,13


**Fact:** looks like a 2D array, but the index is not considered in the data structure. We can ensure this by getting the shape.

In [13]:
songs2.shape

(4,)

In [14]:
# Makings songs3
songs3 = pd.Series([145, 142, 38, 13], name='counts', index=['Paul', 'John', 'George', 'Ringo'])

In [15]:
# Index inspection
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

#### Storing a class into a Series - Heterogeneous or mixed types

In [16]:
# Storing a class into a Series
class Foo:
  pass

ringo = pd.Series(
  ['Richard', 'Starkey', 13, Foo()],
  name='ringo'
)

In [17]:
# Getting the ringo Series
ringo

Unnamed: 0,ringo
0,Richard
1,Starkey
2,13
3,<__main__.Foo object at 0x7c98fa8b6ef0>


### The NaN value

When Pandas determines that a Series holds a numeric values but cannot find a number to represent an entry, it will use **NaN**.

In [18]:
import numpy as np

# a Series
nan_series = pd.Series([2, np.nan])

In [19]:
# Getting the nan_series
nan_series

Unnamed: 0,0
0,2.0
1,


**Fact:** It's a float64 type, because it supports NaN, on the other hand, int64 does not support it.

**Fact 2:** Pandas count() ignores NaN values, so it will returns just the available values.

In [20]:
# Printing the count of nan_series
nan_series.count()

1

#### Optional Integer Support for NaN - P.17

Using None instead "NaN" for Int64 type. This will produce the "NA" value.

In [21]:
# Creating a Series sample
nan_series2 = pd.Series([2, None],
                        index=['Ono', 'Clapton'],
                        dtype='Int64')

In [22]:
# printing nan_series2
nan_series2

Unnamed: 0,0
Ono,2.0
Clapton,


**Fact:** Also this will be ignored by operations like count()

In [23]:
# showing count()
nan_series2.count()

1

**Fact 2:** It's possible too to change the data type.

In [24]:
# Changing the data type
nan_series2.astype('Int64')

Unnamed: 0,0
Ono,2.0
Clapton,


In [25]:
# Printing nan_series2 again
nan_series2

Unnamed: 0,0
Ono,2.0
Clapton,


### Similiar to Numpy P. 17

Both types respond to index operations:

In [26]:
import numpy as np

numpy_series = np.array([145, 142, 38, 13])

In [27]:
# printing an index from numpy_series
numpy_series[1]

142

In [30]:
"""
Pandas has changed the use of Series[index] to Series.iloc[index]
"""
# printing an index from the pandas series
songs3.iloc[1]

142

In [33]:
# Same mean function method; Pandas
songs3.mean()

84.5

In [34]:
# Mean in numpy
numpy_series.mean()

84.5

#### Making a mask

They also both have a notion of a boolean array. A boolean array is a series with the same index as the series you are working with that has boolean values, and it can be used as a mask to filter out items. Normal Python lists do not support such fancy index operations, like sticking a list into an index operation.

In [37]:
# Here is evaluaded the value of each cell. It will return True if exceeds the median value, otherwise False
mask = songs3 > songs3.median()

In [38]:
mask

Unnamed: 0,counts
Paul,True
John,True
George,False
Ringo,False
