## Pandas

Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like Series and DataFrames, which are essential for handling structured data.

A Series is a one-dimensional labeled array capable of holding any data type. It is similar to a list or a dictionary in Python. 

A DataFrame, on the other hand, is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.


### Series

In [1]:
import pandas as pd

In [6]:
data = [1, 2, 3, 4, 5]
series = pd.Series(data) ## Creating a Series from a list

print(f'Series:\n{series}')


Series:
0    1
1    2
2    3
3    4
4    5
dtype: int64


In [7]:
data = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5}
series = pd.Series(data) ## Creating a Series from a dictionary

print(f'Series:\n{series}')


Series:
A    1
B    2
C    3
D    4
E    5
dtype: int64


In [8]:
data = [1, 2, 3, 4, 5]
index = ['A', 'B', 'C', 'D', 'E']
series = pd.Series(data=data, index=index) ## Creating a Series with a custom index

print(f'Series:\n{series}')


Series:
A    1
B    2
C    3
D    4
E    5
dtype: int64


### DataFrame

In [9]:
## Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

print(f'DataFrame:\n{df}')


DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [11]:
## Creating a DataFrame from a list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)

print(f'DataFrame:\n{df}')
print(type(df))


DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
<class 'pandas.core.frame.DataFrame'>


In [None]:
## Reading a DataFrame from a CSV file

df = pd.read_csv('data/sales_data.csv')
df


Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal
...,...,...,...,...,...,...,...,...,...
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.00,270.00,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.00,55.00,Europe,PayPal


### Accessing Data in a DataFrame

| Method      | Usage                                    | Example            |
| ----------- | ---------------------------------------- | ------------------ |
| **`.loc`**  | By label (row & column names)            | `df.loc[3, "Age"]` |
| **`.iloc`** | By integer position (row & column index) | `df.iloc[0, 1]`    |
| **`.at`**   | Single cell by label                     | `df.at[3, "Age"]`  |
| **`.iat`**  | Single cell by integer position          | `df.iat[0, 1]`     |

`loc` → When you want readability and to use row/column names.

`iloc` → When you want to access by row/column index positions.

`at` → When you need a single value by label, and you want it fast.

`iat` → When you need a single value by position, and you want it fast.

On small datasets, you won’t notice much difference, but on large datasets .at and .iat are significantly faster.

In [None]:

data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [18]:
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [19]:
type(df['Name'])

pandas.core.series.Series

In [None]:
df.loc[0]  # Access the first row

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [30]:
df.loc[1, 'City']  # Access the value in the second row and 'City' column  

'Los Angeles'

In [31]:
df.iloc[2]  # Access the third row

Name    Charlie
Age          35
City    Chicago
Name: 2, dtype: object

In [32]:
df.iloc[0, 1]  # Access the value in the first row and second column

np.int64(25)

In [26]:
df.iloc[1, 2]  # Access the value in the second row and third column

'Los Angeles'

In [33]:
df.at[1, 'Age']  # Access the value in the second row and 'Age' column

np.int64(30)

In [34]:
df.iat[1, 1]  # Access the value in the second row and second column

np.int64(30)

### Data Manipulation with DataFrames

In [35]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [36]:
## Adding a New Column

df['Country'] = ['USA', 'UK', 'Canada']
df

Unnamed: 0,Name,Age,City,Country
0,Alice,25,New York,USA
1,Bob,30,Los Angeles,UK
2,Charlie,35,Chicago,Canada


In [37]:
## Removing a Column

df.drop('Country', axis=1, inplace=True) ## axis=1 means we are dropping a column, inplace=True means we are modifying the original DataFrame
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [38]:
## Adding a New Row

df.loc[3] = ['David', 40, 'USA']
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,USA


In [39]:
## Add Age Column
df['Age'] = df['Age'] + 5
df

Unnamed: 0,Name,Age,City
0,Alice,30,New York
1,Bob,35,Los Angeles
2,Charlie,40,Chicago
3,David,45,USA


In [40]:
## Remove a Row
df.drop(3, axis=0, inplace=True)
df

Unnamed: 0,Name,Age,City
0,Alice,30,New York
1,Bob,35,Los Angeles
2,Charlie,40,Chicago


In [41]:
df = pd.read_csv('data/sales_data.csv')
df.head()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [42]:
## Display data types
print('Data Types: ', df.dtypes)

## Describe the DataFrame
print('DataFrame Description: ', df.describe())

Data Types:  Transaction ID        int64
Date                 object
Product Category     object
Product Name         object
Units Sold            int64
Unit Price          float64
Total Revenue       float64
Region               object
Payment Method       object
dtype: object
DataFrame Description:         Transaction ID  Units Sold   Unit Price  Total Revenue
count       240.00000  240.000000   240.000000     240.000000
mean      10120.50000    2.158333   236.395583     335.699375
std          69.42622    1.322454   429.446695     485.804469
min       10001.00000    1.000000     6.500000       6.500000
25%       10060.75000    1.000000    29.500000      62.965000
50%       10120.50000    2.000000    89.990000     179.970000
75%       10180.25000    3.000000   249.990000     399.225000
max       10240.00000   10.000000  3899.990000    3899.990000


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    240 non-null    int64  
 1   Date              240 non-null    object 
 2   Product Category  240 non-null    object 
 3   Product Name      240 non-null    object 
 4   Units Sold        240 non-null    int64  
 5   Unit Price        240 non-null    float64
 6   Total Revenue     240 non-null    float64
 7   Region            240 non-null    object 
 8   Payment Method    240 non-null    object 
dtypes: float64(2), int64(2), object(5)
memory usage: 17.0+ KB


In [44]:
df.describe()

Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue
count,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375
std,69.42622,1.322454,429.446695,485.804469
min,10001.0,1.0,6.5,6.5
25%,10060.75,1.0,29.5,62.965
50%,10120.5,2.0,89.99,179.97
75%,10180.25,3.0,249.99,399.225
max,10240.0,10.0,3899.99,3899.99
