# Data Structures

One of the keys to understanding pandas is to understand the data model. At the core of pandas are two data structures. The most widely used data structures are the Series and the DataFrame for dealing with array data and tabular data. This table shows their analogs in the spreadsheet and database world.


| Data Structure| Dimensionality| Spreadsheet Analog| Databse Analog|
|-|-|-|-|
|Series|1D|Column|Column|
|DataFrame|2D|Single Sheet|Table|

DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column 
of data (when we refer to a column of data in this text, we are referring to a Series).



## Numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.

In [1]:
!pip install numpy




[notice] A new release of pip available: 22.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# https://numpy.org/doc/stable/reference/
import numpy as np

In [7]:
digits = np.array(range(10))
digits

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:
digits = np.array(range(10))

In [13]:
# notice the "See Also" section
np.array

<function numpy.array>

In [14]:
# secret of numpy (there are not 10 Python integers under the array)
digits.dtype

dtype('int64')

In [15]:
np.log(digits)

  np.log(digits)


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

In [16]:
digits

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [17]:
digits + 1

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [18]:
np.log(digits+1)

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
       1.79175947, 1.94591015, 2.07944154, 2.19722458, 2.30258509])

In [19]:
np.sin(digits)

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])

In [20]:
len(dir(np))

530

In [21]:
digits

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [22]:
len(dir(digits))

168

In [None]:
digits.mean()

In [None]:
digits + 10

In [None]:
digits * 1000

In [None]:
np.arange(100).reshape(20, 5)

In [None]:
# 2d
nums = np.arange(100).reshape(20, 5)
nums

In [None]:
nums.transpose()

In [None]:
nums.mean()

In [None]:
nums.mean(axis=0)

In [None]:
nums.mean(axis=1)

In [None]:
nums.mean(axis=1, keepdims=True)

In [None]:
#3d
b = np.arange(70).reshape(7,5,2)
b

In [None]:
b.mean(axis=0)

In [None]:
b.mean(axis=1)

In [None]:
b.mean(axis=2)

## Why is Numpy Fast?

Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

- vectorized code is more concise and easier to read

- fewer lines of code generally means fewer bugs

- the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)

- vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.

Broadcasting is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast. Moreover, in the example above, a and b could be multidimensional arrays of the same shape, or a scalar and an array, or even two arrays of with different shapes, provided that the smaller array is “expandable” to the shape of the larger in such a way that the resulting broadcast is unambiguous.

## Slicing

In [None]:
nums

In [None]:
nums[0]

In [None]:
nums[[0,5,10]]

In [None]:
# first ten rows
nums[0:10]

In [None]:
# first three columns with all rows
nums[:,0:3]

In [None]:
# Select All columns for 15th row
nums[14]

In [None]:
# Select the third and 5th columns for the 9th row
nums[9, [2,4]]

## Boolean arrays

In [None]:
nums

In [None]:
nums % 2

In [None]:
nums % 2 == 0

In [None]:
nums[nums % 2 == 0]

In [None]:
(nums[nums % 2 == 0]).shape

In [None]:
# select rows where sum is less than 100
# Column-wise summary
nums.sum(axis=1)

In [None]:
nums.sum(axis=1) < 100

In [None]:
nums[nums.sum(axis=1)< 100]

In [None]:
# select columns where mean > 50
nums.mean(axis=0)

In [None]:
nums.mean(axis=0) > 50

In [None]:
nums[:, nums.mean(axis=0) > 50]

## Universal Functions

In [None]:
nums + 5

In [None]:
nums

In [None]:
np.add(nums, 5, where=(nums % 2 == 0))

In [None]:
np.add?

In [None]:
np.mod(nums, 5)  # nums % 2

In [None]:
np.log(nums)

In [None]:
# trig
np.sin(nums)

In [None]:
nums

In [None]:
nums[nums % 2 == 0]

In [None]:
# logic
nums[np.logical_and(nums, nums % 2 == 0)]

In [None]:
np.logical_and?

In [None]:
nums[nums > 10]

In [None]:
# comparison
nums[np.greater(nums, 10)]

In [None]:
# floating point
np.ceil(nums/3)

In [None]:
np.ceil?


# Data Structures

One of the keys to understanding pandas is to understand the data model. At the core of pandas are two data structures. The most widely used data structures are the Series and the DataFrame for dealing with array data and tabular data. This table shows their analogs in the spreadsheet and database world.


| Data Structure| Dimensionality| Spreadsheet Analog| Databse Analog|
|-|-|-|-|
|Series|1D|Column|Column|
|DataFrame|2D|Single Sheet|Table|

DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column 
of data (when we refer to a column of data in this text, we are referring to a Series).



## Pandas

In pandas, the two-dimensional counterpart to the one-dimensional Series is the DataFrame. If we want to understand this data structure, it helps to know how it is constructed. This chapter will
introduce the dataframe.


Dataframes can be created from many types of input:
- columns (dicts of lists)
- rows (list of dicts)
- CSV, xlsx files (pd.read_csv)
- NumPy ndarrays
- other: SQL, HDF5, arrow, etc

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('transaction_data.csv')
df

In [None]:
type(df['Amount'])

> Just like an excel table is made up of columns, a DataFrame is made up of Series Data Types

# Pandas Series



## Operator Methods

A Series is used to model one-dimensional data. The Series object also has a few more bits of data,
including an index and a name. A common idea through pandas is the notion of an axis. Because
a series is one-dimensional, it has a single axis—the index.
Below is a table of counts of songs artists compos


||Data|
|-|-|
|0|145|
|1|142|
|2|38|
|3|13| We

In [None]:
# Create a Series
songs = pd.Series([145, 142, 38, 13], name= 'counts')

In [None]:
# Preview the series
songs

In [None]:
# Series Indexing
songs[3]

In [None]:
songs > 50

In [None]:
songs[(songs > 50)]

In [None]:
# Boolean Indexing
mask = songs > songs.median()

mask

In [None]:
songs[mask]

### Numerical Operations

In [None]:
# New songs

# Create a Series
new_songs = pd.Series([54, 83, 26, 90], name= 'counts')

In [None]:
new_songs

In [None]:
songs

In [None]:
# Adding 2 series

new_songs + songs

In [None]:
other_songs = pd.Series([54, 83, 26, 90], name= 'new_counts', index=[3,2,1,0])

other_songs

In [None]:
new_songs

In [None]:
other_songs + new_songs

In [None]:
# Working with Null Values
another_song_count = pd.Series([1, 10, 60, np.nan], name='other_count')

another_song_count

In [None]:
new_songs.add(another_song_count, fill_value=0)

In [None]:
new_songs.add(songs)

In [None]:
new_songs.add?

In [None]:
# Chaining

(
    songs
    # Add the new songs
    .add(new_songs)
)


|Method|Operator|Description|
|-|-|-|
|s.add(s2) | s + s2 | Adds series
|s.radd(s2) | s2 + s | Adds series
|s.sub(s2) | s - s2 | Subtracts series
|s.rsub(s2) | s2 - s | Subtracts series
|s.mul(s2) s.multiply(s2) | s * s2 | Multiplies series
|s.rmul(s2) | s2 * s | Multiplies series
|s.div(s2) s.truediv(s2) | s / s2 | Divides series
|s.mod(s2) | s % s2 | Modulo of series division
|s.eq(s2) | s2 == s | Elementwise equals of series
|s.ne(s2) | s2 != s | Elementwise not equals of series
|s.gt(s2) | s > 2 | Elementwise greater than of series
|s.ge(s2) | s >= 2 | Elementwise greater than or equals of series
|s.lt(s2) | s < 2 | Elementwise less than of series
|s.le(s2) | s <= 2  | Elementwise less than or equals of series

### Aggregate Methods

Aggregate methods collapse the values of a series down to a scalar. Aggregations are the numbers that your boss wants to be reported. If you worked at an ice cream joint and the boss came in and asked how the restaurant was doing, you wouldn’t answer, ”Portia ordered a burger and fries. Mawuli ordered a cheeseburger and shake. Tom ordered ...”.
Your boss doesn’t care about that level of detail. They care about:

- How many people came in (count)
- How much food was ordered (count)
- What was the total revenue (sum)
- When did people come (skew)
- What was the average purchase amount (mean)


> Aggregations allow you to take detailed data and collapse it to a single value.

In [None]:
songs

In [None]:
songs.quantile()

In [None]:
(145+142+38) / 3

In [None]:
(
    songs
    # Greater than 25
    .gt(25)
    # Find the mean
    .mean()
)

In [None]:
# Find the sum of songs greater than 25

# Step 1 Select the songs > 25
# Find their sum

(songs[songs > 25]).mean()

In [None]:
# Percentage
(
    songs
    # Greater than 25
    .gt(25)
    # Multiple
    .mul(100)
    # Find the mean
    .mean()
)

|Method|Description|
|-|-|
|'mad' |Return the mean absolute deviation.|
|'max' |Return the maximum value.|
|'mean'| Return the mean value.|
|'median'| Return the median value.|
|'min' |Return the minimum value.|
|'nbytes' |Return the number of bytes of the data.|
|'ndim' |Return the number of dimensions (1) of the data.|
|'nunique' |Return the count of unique values.|
|'quantile' |Return the median value. Can override q to specify other quantile.|
|'sem' |Return the unbiased standard error.|
|'size' |Return the size of the data.|
|'skew' |Return the unbiased skew of the data. Negativeindicates tail is on the left side.|
|'std' |Return the standard deviation of the data.|
|'sum' |Return the sum of the series.|

With the series defined below:

```{python}
ser = df['Amount']
```
Find the following:
1. Find the count of non-missing values of a series.
2. Find the number of entries of a series.
3. Find the number of unique entries of a series.
4. Find the mean value of a series.
5. Find the maximum value of a serf the

In [None]:
ser = df['Amount']

ser

### Indexing 


Let’s shift the focus onto pulling data out by using indexing operators. You can index directly on
a series object, but this is not recommended. Use the `.iloc` or `loc` to index.


The `.loc` attribute deals with index labels. It allows you to pull out pieces of the series. You can
pass in the following into an index operation on .loc:
- A scalar value of one of the index labels
- A list of index labels.
- A slice of labels (closed interval so it includes the stop value).
- An index.
- A boolean array (same index labels as the series, but with True or False values.
- A function that accepts a series and returns one of the above.

In [None]:
ser = pd.Series([78, 165, 744, 475, 214, 367], name='count', index=['Glen', 'Mawuli', 'Portia', 'Racheal', 'Portia', 'Mavis'])

ser

In [None]:
ser[0]

In [None]:
ser.loc['Racheal']

In [None]:
ser.loc[['Glen', 'Racheal', 'Mavis']]

In [None]:
type(ser.loc['Glen'])

In [None]:
type(ser.loc['Portia'])

In [None]:
ser.gt(100)

In [None]:
ser.loc[(ser.gt(100))]

> If we want to return a series object, we can index it with a list of positions. This can be a list
with a single index in it or multiple index values. The following code will return a series with the
first, second, and last values:

In [None]:
# Using loc
songs

In [None]:
songs.loc[0:3]

In [None]:
songs.iloc[0:3]

In [None]:
# Using iloc

ser.iloc[[0,1,5]]

> We can also use slices with .iloc. In this case, slices behave as they do in Python lists and follow
the half-open interval. That is, they include the first index and go up to but do not include the last
index. If we want to return the first five items, we can use the .head method or the following code,
which takes index positions starting at 0 and includes 1, 2, 3, and 4, but does not include 5:

In [None]:
songs.loc[1:6]

### Head and Tail

The `.head` and `.tail` methods are useful for pulling out values at the start or end of the series,
respectively. These methods are used to quickly inspect a chunk of the data.

In [None]:
ser.head(3)

In [None]:
ser.tail(2)

In [None]:
ser_2 = df['Amount']


ser_2.head()

# Plotting with a Series 

Inspecting statistical summaries and tables can reveal much about your data. Another technique
to understand the data at a more intuitive level is to plot it

> Please install matplotlib


## The .plot Attribute 
A series object has a .plot attribute. This attribute is interesting as you can call it directly to create
plots, or access sub-attributes of it..

In [None]:
amount_ser = df['Amount']


amount_ser.head()

## Histograms

If you have continuous numeric data, plotting a histogram can give you insight into how the data 
is distributed:

In [None]:
(
    amount_ser
    # Plot a Histogram
    .plot.hist()
)

In [None]:
# Ploting with no outliers
(
    amount_ser
    # Filter out amount > 40
    [amount_ser < 40]
    # Plot a histogram
    .plot.hist(bins=15)
)

## Box Plot

You can also create a boxplot to view the distribution of the data.

In [None]:
(
    amount_ser
    # Creating a boxplot
    .plot.box()
)

In [None]:
(
    amount_ser
    # Selecting values < 40
    [amount_ser < 40]
    # Creating a boxplot
    .plot.box()
)

In [None]:
(
    amount_ser
    # Pull out your quantile values
    .quantile([0.25, 0.5, 0.75])
)

## Pandas

In pandas, the two-dimensional counterpart to the one-dimensional Series is the DataFrame. If we want to understand this data structure, it helps to know how it is constructed. This chapter will
introduce the dataframe.


Dataframes can be created from many types of input:
- columns (dicts of lists)
- rows (list of dicts)
- CSV, xlsx files (pd.read_csv)
- NumPy ndarrays
- other: SQL, HDF5, arrow, etc

In [None]:
## df = pd.read_csv('transaction_data.csv')
df

In [None]:
# Preview the dataset
df.head()

## Math Methods in Dataframes

|Method |Description|
|-|-|
|.add(other, axis='columns', level=None, fill_value=None) |Add other to dataframe across axis. Unlike operator, can specify fill_value.|
|.sub(other, axis='columns', level=None, fill_value=None) |Subtract other from dataframe across axis. Unlike operator, can specify fill_value.|
|.mul(other, axis='columns', level=None, fill_value=None) |Multiply other with dataframe across axis. Unlike operator, can specify fill_value.|
|.div(other, axis='columns', level=None, fill_value=None) |Divide dataframe by other across axis. Unlike operator, can specify fill_value.|
|.truediv(other, axis='columns', level=None, fill_value=None) |Same as .div.|
|.floordiv(other, axis='columns', level=None, fill_value=None) |Integer divide dataframe by other across axis. Unlike operator, can specify fill_value.|
|.mod(other, axis='columns', level=None, fill_value=None) |Perform modulo operation with other across axis. Unlike operator, can specify fill_value.|
|.pow(other, axis='columns', level=None, fill_value=None) |Raise to other power across axis. Unlike operator, can specify fill_value.|

In [None]:
# Creating a new column
df['Amount'] * 0.5

In [None]:
df['Amount'].mul(0.5)

In [None]:
# Sales Tax
df['Amount'] * 0.2

In [None]:
df['Amount'].mul(0.2)

In [None]:
# Total Amount Paid
df['Amount'].add(df['Amount'].mul(0.25) )

In [None]:
# Assign the new col to the df
df['half_price'] = df['Amount'] * 0.5

In [None]:
df['disc_amount'] = df['Amount'].mul(0.3)

In [None]:
df.head()

In [None]:
# Using the assign method

df  = (
    df
    # Creating a new col called sales_tax
    .assign(sales_tax = df['Amount'].mul(0.15))
   )

In [None]:
df.head()

## Aggregations

The aggregations that are found in a series are also applicable to a dataframe. You need to keep in 
mind that a dataframe has two dimensions. This means you can aggregate across both dimensions.
So you can sum along axis 0 (the index) or axis 1 (the columns). In this example, we will calcula the average of each row. We will isolate the numeric columns using `.loc`, then we will sum along
the columns and divide the result by the length of the columns:te

In [None]:
df.head()

In [None]:
# Using the .describe method

df.describe()

In [None]:
(
    df['Amount']
    .sum()
)

In [None]:
(
    df['sales_tax']
    # Find the max sales tax
    .max()
)

In [None]:
# Grouping in Python
(
    df
    # Lets group by Traffic source
    .groupby(by='Traffic_Source')
    # Let's find avg amount spent per traffic source
    ['Amount'].mean()
)

In [None]:
# Grouping in Python
(
    df
    # Lets group by Traffic source
    .groupby(by='Traffic_Source')
    # Let's find avg amount spent per traffic source
    ['Amount'].sum()
    # Sort Values
    .sort_values(ascending=False)
)

In [None]:
(
    df
    # Group by item description
    .groupby(by='Item_Description')
    # Sum of the amount paid
    ['Amount'].sum()
    # Sort
    .sort_values(ascending=False)
)

## Sorting Columns

The `.sort_values` method will you sort the rows of a dataframe by arbitrary columns. :

In [None]:
(
    df['Item_Description']
    # Value count all by traffic
    .value_counts()
)

## Dataframe Indexing, Filtering




In [None]:
df.head()

In [None]:
# How do we select a column

df['sales_tax']

In [None]:
# How do I select 2 or more columns
df[['sales_tax', 'half_price']]

In [None]:
# Select three columns
df[['sales_tax', 'Amount', 'Traffic_Source']]

## Using `.loc` for filtering

In [None]:
df['Amount'] < 40

In [None]:
# Select records with amount < 40

df.loc[(df['Amount'] < 40)]

In [None]:
# Selecting amounts  == 6

mask = df['Amount'] == 6


df.loc[mask]

In [None]:
# Select the item description and traffic source for items with amount == 6

df.loc[(df['Amount'] == 6), ['Item_Description', 'Traffic_Source', 'Amount']]

In [None]:
mask = df['Traffic_Source'] == 'Paid Search'


df.loc[(mask), ['Amount', 'sales_tax']]

In [None]:
# Select records for Paid Source where amount == 6
mask = (df['Traffic_Source'] == 'Paid Search') & (df['Amount'] == 6)


df.loc[mask]

In [None]:
# Select records for Paid Source where amount == 6 and item_description == Lanyard
mask = (df['Traffic_Source'] == 'Paid Search') & (df['Amount'] == 6) & (df['Item_Description'] == 'Lanyard')


df.loc[mask]

In [None]:
# Select records for Paid Source where amount == 6 and item_description == Lanyard
mask = (df['Traffic_Source'] == 'Paid Search') & (df['Amount'] == 6) & (df['Item_Description'] == 'Lanyard')


df.loc[mask, ['Traffic_Source', 'Amount', 'Item_Description']]

In [None]:
# Select records for Paid Source where amount == 6 and item_description == Lanyard
mask = (df['Traffic_Source'] == 'Paid Search') & (df['Amount'] == 6) & (df['Item_Description'] == 'Lanyard')


# Sum of the amount
(df.loc[mask, 'Amount']).sum()

## Plotting with Dataframes

Pandas has an integration with matplotlib. This integration makes it easy to 
create various plots if you understand what type of plot you want.

### Bar Plots



In [None]:
df = pd.read_csv('transaction_data.csv')

# Preview the dataset
df.head()

In [None]:
(
    df
    # Group the dataframe by traffic_source
    .groupby(by='Traffic_Source')
    # Aggregate by the sum of Amount
    ['Amount'].sum()
    # sort values in descending order
    .sort_values(ascending=False)
    # Plot your bar plot
    .plot.bar(rot=45)
)

In [None]:
(
    df
    # Group the dataframe by traffic_source
    .groupby(by='Traffic_Source')
    # Aggregate by the sum of Amount
    ['Amount'].sum()
    # sort values in descending order
    .sort_values(ascending=True)
    # Plot your bar plot
    .plot.barh(rot=45)
)

In [None]:
(
    df
    # Group the dataframe by Item_Description
    .groupby(by='Item_Description')
    # Aggregate by the sum of Amount
    ['Amount'].sum()
    # sort values in descending order
    .sort_values(ascending=True)
    # Plot your bar plot
    .plot.barh(rot=45)
)

### Scatterplots

A scatter plot is useful to determine the relationship between two columns that are numeric. We
can evaluate what tends to happen to one value as the other value changes.

In [None]:
(
    df
    # Filter for only amount < 20
    [df['Amount'] < 20]
    .plot.scatter(x='Amount', y='Session_Duration')
)