# Pandas - DataFrame and Series
Pandas is a powerful data manipulation library in Python, widely used for data analysis and data cleaning. It provies two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns)

## What is a Pandas DataFrame?
A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table, and is one of the most commonly used data structures for data analysis in Python.

Key features of a DataFrame:
- Each column can have a different data type (integer, float, string, etc.).
- Columns and rows are labeled, making data easy to access and manipulate.
- Supports a wide range of operations for filtering, grouping, merging, reshaping, and summarizing data.
- Handles missing data gracefully.

DataFrames are ideal for working with real-world data, which often comes in tabular form and may contain different types of information in each column.

Heterogeneous tabular data refers to a table (like a spreadsheet or DataFrame) where different columns can contain different types of data. For example, one column might have integers, another might have floating-point numbers, and another might have strings or dates. This is in contrast to homogeneous data structures (like NumPy arrays), where all elements must be of the same type. Pandas DataFrames are designed to handle heterogeneous tabular data, making them very flexible for real-world datasets.

### Make sure to have pandas installed

In [1]:
import pandas as pd

In [None]:
# Series
# A pandas series is a one-dimensional array-like object that can hold any data type.
# It is similar to a column in a spreadsheet or a database table.

data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print("Series:\n", series)
# first column is the index, second column is the value

Series:
 0    1
1    2
2    3
3    4
4    5
dtype: int64


In [None]:
# Create a Series from dictionary
data_dict = {'a': 1, 'b': 2, 'c': 3}
series_from_dict = pd.Series(data_dict)
print("Series from dictionary:\n", series_from_dict)

a    1
b    2
c    3
dtype: int64


In [21]:
# you can give a custom index to the series
data = [1, 2, 3, 4, 5]
index = ['one', 'two', 'three', 'four', 'five']
custom_index = pd.Series(data, index)
print("Series with custom index:\n", custom_index)

Series with custom index:
 one      1
two      2
three    3
four     4
five     5
dtype: int64


In [None]:
# DataFrame
# A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
# we can create a DataFrame from a dictionary of lists or arrays.
data_dict = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
data_frame = pd.DataFrame(data_dict)
print("DataFrame:")
data_frame # don't print dataframe data with print function, output doesn't look as nice

DataFrame:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [None]:
# Create a DataFrame from a List of dictionaries
data_list_of_dicts = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
data_frame_from_list = pd.DataFrame(data_list_of_dicts)
print("DataFrame from list of dictionaries:")
data_frame_from_list # don't print dataframe data with print function, output doesn't look as nice

DataFrame from list of dictionaries:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [None]:
df = pd.read_csv('sales_data.csv')
df.head(5)  # Display the first 5 rows of the DataFrame
# print("DataFrame from CSV:\n", df.head(5)) #don't print dataframe data with print function, output doesn't look as nice

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [22]:
df.tail(5)  # Display the last 5 rows of the DataFrame

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [27]:
# Accessing Data from DataFrame
data_frame

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [None]:
data_frame['Name']  # Access a single column transform it into a Series

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [29]:
data_frame.loc[0]  # Access the first row of the DataFrame as a Series

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [30]:
data_frame.iloc[0]  # Access the first row of the DataFrame as a Series using integer location

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [None]:
data_frame # displaying the DataFrame again to see the output

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [49]:
# Accessing a specified element in the DataFrame using dictionary key
data_frame.at[0, 'Name']  # Access the 'Name' of the first row

'Alice'

In [50]:
# Accessing a specified element in the DataFrame using integer location
data_frame.iat[1, 1]  # Access the 'Age' of the second row using integer location

np.int64(30)

# Data Manipulation with Dataframe

In [51]:
# Data Manipulation with Dataframe
data_frame

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [58]:
data_frame['Salary'] = [50000, 60000, 70000]  # Adding a new column 'Salary'
data_frame

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,50000
1,Bob,30,Los Angeles,60000
2,Charlie,35,Chicago,70000


In [59]:
# Remove a column from the DataFrame
data_frame.drop('Salary', axis=1, inplace=True)  # axis=1 indicates column, by default it is 0 (row)
# must include inplace=True to modify the original DataFrame
data_frame  # Display the DataFrame after removing the 'Salary' column

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [60]:
# Add age to the column
data_frame['Age'] = data_frame['Age'] + 5  # Incrementing the 'Age' column by 5
data_frame

Unnamed: 0,Name,Age,City
0,Alice,30,New York
1,Bob,35,Los Angeles
2,Charlie,40,Chicago


In [None]:
# Removing the first row of the DataFrame
data_frame.drop(0, inplace=True)
data_frame  # Display the DataFrame after removing the first row

Unnamed: 0,Name,Age,City
1,Bob,35,Los Angeles
2,Charlie,40,Chicago


In [62]:
df = pd.read_csv('sales_data.csv')
df.head(5)  # Display the first 5 rows of the DataFrame

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [66]:
# Display the data types of each column in the DataFrame
print("Data types of each column:\n", df.dtypes)

# Describe the DataFrame to get statistical summary
print("\nStatistical summary of the DataFrame:")
df.describe()

Data types of each column:
 Transaction ID        int64
Date                 object
Product Category     object
Product Name         object
Units Sold            int64
Unit Price          float64
Total Revenue       float64
Region               object
Payment Method       object
dtype: object

Statistical summary of the DataFrame:


Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue
count,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375
std,69.42622,1.322454,429.446695,485.804469
min,10001.0,1.0,6.5,6.5
25%,10060.75,1.0,29.5,62.965
50%,10120.5,2.0,89.99,179.97
75%,10180.25,3.0,249.99,399.225
max,10240.0,10.0,3899.99,3899.99
