# 📘 Pandas: Introduction to Data Structures

This notebook provides a comprehensive overview of the fundamental data structures in pandas:
- **Series**: One-dimensional labeled arrays
- **DataFrame**: Two-dimensional labeled data structures

We'll explore their creation, manipulation, and various operations you can perform with them.

## 1. Importing Required Libraries

First, let's import the necessary Python libraries:

In [None]:
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis

# Set some display options for better output readability
pd.set_option('display.max_rows', 8)
pd.set_option('display.precision', 4)

## 2. Series Data Structure

A **Series** is a one-dimensional labeled array that can hold data of any type (integer, float, string, Python object, etc.). It's similar to a column in a spreadsheet or a single column in a SQL table.

### Creating Series

In [None]:
# Create a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series from list:")
print(s)

# Create a Series with custom index
s_custom = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("\nSeries with custom index:")
print(s_custom)

# Create a Series from a dictionary
d = {'a': 0., 'b': 1., 'c': 2.}
s_dict = pd.Series(d)
print("\nSeries from dictionary:")
print(s_dict)

# Create a Series with a scalar value
s_scalar = pd.Series(5., index=['a', 'b', 'c'])
print("\nSeries from scalar value:")
print(s_scalar)

### Series Operations and Attributes

Series supports various operations and has useful attributes. Let's explore some of them:

In [None]:
# Create a sample Series
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print("Original Series:")
print(s)

# Basic attributes
print("\nIndex:", s.index)
print("Values:", s.values)
print("Data type:", s.dtype)

# Basic operations
print("\nMean:", s.mean())
print("Sum:", s.sum())
print("Standard deviation:", s.std())

# Filtering
print("\nValues > 0:")
print(s[s > 0])

# Arithmetic operations
print("\nMultiply by 2:")
print(s * 2)

# Mathematical functions
print("\nAbsolute values:")
print(np.abs(s))

## 3. DataFrame Creation Methods

A **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table. Let's explore different ways to create DataFrames.

In [None]:
# Create DataFrame from dictionary of lists
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 22, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df1 = pd.DataFrame(data)
print("DataFrame from dictionary of lists:")
print(df1)

# Create DataFrame from dictionary of Series
data_series = {
    'A': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'B': pd.Series([4., 5., 6., 7.], index=['a', 'b', 'c', 'd'])
}
df2 = pd.DataFrame(data_series)
print("\nDataFrame from dictionary of Series:")
print(df2)

# Create DataFrame from NumPy array
array = np.random.randn(3, 4)
df3 = pd.DataFrame(array, columns=list('ABCD'))
print("\nDataFrame from NumPy array:")
print(df3)

# Create DataFrame from list of dictionaries
list_of_dicts = [
    {'a': 1, 'b': 2},
    {'a': 5, 'b': 10, 'c': 20}
]
df4 = pd.DataFrame(list_of_dicts)
print("\nDataFrame from list of dictionaries:")
print(df4)

## 4. DataFrame Operations

Let's explore basic operations you can perform with DataFrames:

In [None]:
# Create a sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
print("Original DataFrame:")
print(df)

# Basic information about DataFrame
print("\nDataFrame Info:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

# Adding a new column
df['E'] = df['A'] + df['B']
print("\nAfter adding column 'E':")
print(df)

# Deleting a column
del df['E']
print("\nAfter deleting column 'E':")
print(df)

# Arithmetic operations
print("\nMultiply all values by 2:")
print(df * 2)

# Transpose
print("\nTransposed DataFrame:")
print(df.T)

## 5. Indexing and Selection

pandas provides multiple ways to select and access data in DataFrames:

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': range(1, 5),
    'B': range(10, 50, 10),
    'C': ['a', 'b', 'c', 'd']
}, index=['w', 'x', 'y', 'z'])
print("Sample DataFrame:")
print(df)

# Select a single column (returns Series)
print("\nSelect column 'A':")
print(df['A'])

# Select multiple columns
print("\nSelect columns 'A' and 'B':")
print(df[['A', 'B']])

# Label-based indexing with .loc
print("\nSelect row 'y' using .loc:")
print(df.loc['y'])

# Integer-based indexing with .iloc
print("\nSelect first row using .iloc:")
print(df.iloc[0])

# Boolean indexing
print("\nSelect rows where A > 2:")
print(df[df['A'] > 2])

# Selecting a subset of rows and columns
print("\nSelect rows 'x','y' and columns 'B','C':")
print(df.loc[['x', 'y'], ['B', 'C']])

## 6. Missing Data Handling

pandas uses the floating-point value `NaN` (Not a Number) to represent missing data. Let's see how to handle missing data:

In [None]:
# Create a DataFrame with missing values
df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': [5, 6, np.nan, 8],
    'C': [9, 10, 11, 12]
})
print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\nMissing value check:")
print(df.isna())

# Count missing values in each column
print("\nCount of missing values per column:")
print(df.isna().sum())

# Drop rows with any missing values
print("\nDrop rows with missing values:")
print(df.dropna())

# Fill missing values with a specific value
print("\nFill missing values with 0:")
print(df.fillna(0))

# Fill missing values with column mean
print("\nFill missing values with column mean:")
print(df.fillna(df.mean()))

## 7. Data Alignment

One of pandas' most powerful features is automatic data alignment in computations. Let's see how this works:

In [None]:
# Create two DataFrames with different indices
df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}, index=['a', 'b', 'c'])

df2 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}, index=['b', 'c', 'd'])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Adding DataFrames
print("\ndf1 + df2 (notice automatic alignment):")
print(df1 + df2)

# Series alignment
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

print("\nSeries 1:")
print(s1)
print("\nSeries 2:")
print(s2)
print("\ns1 + s2 (notice automatic alignment):")
print(s1 + s2)

## 8. Data Type Operations

pandas supports a variety of data types. Let's explore how to work with different data types:

In [None]:
# Create a DataFrame with different data types
df = pd.DataFrame({
    'integer': [1, 2, 3],
    'float': [1.1, 2.2, 3.3],
    'string': ['a', 'b', 'c'],
    'boolean': [True, False, True],
    'datetime': pd.date_range('2023-01-01', periods=3)
})

# Check data types
print("Data types of each column:")
print(df.dtypes)

# Convert types
print("\nConvert float to integer:")
print(df['float'].astype('int32'))

print("\nConvert integer to string:")
print(df['integer'].astype(str))

# Check if object is of specific dtype
print("\nIs 'integer' column numeric?")
print(pd.api.types.is_numeric_dtype(df['integer']))

# Memory usage
print("\nMemory usage per column:")
print(df.memory_usage(deep=True))

## 9. Vectorized Operations

pandas operations are vectorized for better performance. This means operations are applied to all elements without explicit loops:

In [None]:
# Create a DataFrame
df = pd.DataFrame(np.random.randn(4, 4), columns=list('ABCD'))
print("Original DataFrame:")
print(df)

# Arithmetic operations
print("\nAdd 5 to all elements:")
print(df + 5)

# Mathematical functions
print("\nSquare root of absolute values:")
print(np.sqrt(np.abs(df)))

# Boolean operations
print("\nValues greater than 0:")
print(df > 0)

# String operations on Series
s = pd.Series(['cat', 'dog', 'rabbit'])
print("\nOriginal Series:")
print(s)
print("\nUppercase:")
print(s.str.upper())

# Apply custom function
print("\nApply custom function (length of string):")
print(s.apply(len))

## 10. DataFrame Display Options

pandas provides various options to customize how DataFrames are displayed:

In [None]:
# Create a large DataFrame
df = pd.DataFrame(np.random.randn(10, 6))

# Default display
print("Default display:")
print(df)

# Set maximum rows to display
pd.set_option('display.max_rows', 5)
print("\nDisplay with max 5 rows:")
print(df)

# Set precision for floating point numbers
pd.set_option('display.precision', 2)
print("\nDisplay with 2 decimal places:")
print(df)

# Set maximum columns to display
pd.set_option('display.max_columns', 3)
print("\nDisplay with max 3 columns:")
print(df)

# Reset options to default
pd.reset_option('all')
print("\nReset to default display:")
print(df)

## Summary

In this notebook, we've covered:
1. Series creation and operations
2. DataFrame creation methods
3. Basic DataFrame operations
4. Indexing and selection techniques
5. Handling missing data
6. Data alignment
7. Data type operations
8. Vectorized operations
9. Display options

These are the fundamental building blocks for data analysis with pandas. With these concepts, you can start exploring and manipulating your data effectively.