# Pandas Basics: Data Analysis and Manipulation

Pandas is a powerful, fast, and flexible open-source data analysis and manipulation library built on top of NumPy. It provides data structures and functions needed to work efficiently with structured data, such as tables and time-series.

## Key Features of Pandas:
1. Provides **Series** and **DataFrame**, powerful data structures for efficient data manipulation.
2. Supports **data alignment**, **handling missing data**, and **label-based indexing**.
3. Offers built-in **grouping, merging, and reshaping capabilities**.
4. **High-performance** operations for reading/writing various file formats (CSV, Excel, SQL, etc.).
5. Enables **time-series functionality** and **data cleaning tools**.


## Working with Pandas Series

A **Series** is a one-dimensional labeled array that can hold any data type.


In [5]:
import pandas as pd
import numpy as np

In [12]:
# Creating a Pandas Series
data = [10, 20, 30, 40, 50]
index_labels = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index_labels)

# Display the series
print(series)

a    10
b    20
c    30
d    40
e    50
dtype: int64


Creating a simple pandas Series without an index

In [7]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Accessing the array representation and index of the Series

In [11]:
print(obj.array)  # Returns a PandasArray wrapping a NumPy array
print(obj.index)  # Default RangeIndex from 0 to N-1


<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64
RangeIndex(start=0, stop=4, step=1)


Creating a Series with a custom index

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
print(obj2)

Accessing the index of the Series


In [None]:
print(obj2.index)

Accessing values using labels

In [None]:

print(obj2["a"])  # Returns -5
obj2["d"] = 6  # Modifies value at index "d"
print(obj2[["c", "a", "d"]])  # Selecting multiple values using index labels


Checking membership in Series index

In [None]:
print("b" in obj2)  # Returns True
print("e" in obj2)  # Returns False


Creating a Series from a Python dictionary

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
print(obj3)


Converting a Series back to a dictionary

In [None]:
 
print(obj3.to_dict())

Creating a Series with a custom index that is different from dictionary keys

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
print(obj4)  # "California" will have NaN since it's not in sdata

Checking for missing values

In [None]:

print(pd.isna(obj4))  # Returns a boolean Series
print(pd.notna(obj4))  # Opposite of isna()


Using instance methods to check missing values

In [None]:

print(obj4.isna())  # Returns a boolean Series

Performing arithmetic operations while aligning indexes

In [None]:

print(obj3 + obj4)  # Missing values propagate (NaN)

Assigning names to Series and index

In [None]:

obj4.name = "population"
obj4.index.name = "state"
print(obj4)

Modifying the index of a Series

In [None]:

obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

In [None]:
import pandas as pd

# ===============================================
# Part 1: Creating a Pandas Series
# ===============================================

# Step 1: Create a simple Pandas Series with the following values: [5, 10, 15, 20, 25]
#         and assign it to a variable called `s1`.

# Hint: Use the `pd.Series()` function. You can pass a list of values to it.

s1 = pd.Series([5, 10, 15, 20, 25])
print("Part 1 - Simple Series:")
print(s1)
print("\n")

# Step 2: Create another Pandas Series with custom indexes. 
#         Use the following data: [50, 100, 150, 200]
#         Assign it to a variable called `s2`, and use indexes ['a', 'b', 'c', 'd'].

# Hint: Provide a list of index values using the `index` parameter in `pd.Series()`.

s2 = pd.Series([50, 100, 150, 200], index=['a', 'b', 'c', 'd'])
print("Part 2 - Series with Custom Indexes:")
print(s2)
print("\n")

# ===============================================
# Part 2: Accessing and Manipulating Data in Series
# ===============================================

# Step 3: Access the value at index 'c' in the `s2` Series.

# Hint: You can access values using the index like `s2['c']`.

value_c = s2['c']
print("Part 3 - Accessing value at index 'c' in s2:")
print(value_c)
print("\n")

# Step 4: Add `s1` and `s2` element-wise. 
#         Make sure that the indexes in both Series match before performing the operation.

# Hint: You might need to reindex `s1` or `s2` to match each other. Use `reindex()` if needed.

# First, let's reindex s1 to match the indexes of s2.
s1_reindexed = s1.reindex(['a', 'b', 'c', 'd', 'e'])
print("Part 4 - Adding s1 and s2 element-wise after reindexing:")
print(s1_reindexed + s2)
print("\n")

# Step 5: Multiply `s1` by 3.

# Hint: You can multiply a Series by a scalar directly (like `s1 * 3`).

s1_multiplied = s1 * 3
print("Part 5 - Multiply s1 by 3:")
print(s1_multiplied)
print("\n")

# Step 6: Filter out values in `s1` that are greater than 15.

# Hint: You can filter values in a Series by applying a condition like `s1[s1 > 15]`.

filtered_values = s1[s1 > 15]
print("Part 6 - Filter values in s1 greater than 15:")
print(filtered_values)
print("\n")

# ===============================================
# Part 3: Additional Task
# ===============================================

# Step 7: Create a Series where the indexes are ['one', 'two', 'three', 'four', 'five']
#         and the values are the squares of the numbers 1, 2, 3, 4, 5. 
#         Assign it to a variable called `squares_series`.

# Hint: You can manually create the squares of numbers from 1 to 5 or use a loop/list comprehension.

squares_series = pd.Series([1, 4, 9, 16, 25], index=['one', 'two', 'three', 'four', 'five'])
print("Part 7 - Series of squares from 1 to 5:")
print(squares_series)

## Working with Pandas DataFrame

A **DataFrame** is a two-dimensional table with labeled axes (rows and columns). It is the most commonly used pandas structure for handling tabular data.


In [None]:
# Creating a Pandas DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

Creating a DataFrame from a dictionary of lists
Each key in the dictionary represents a column name, and values are lists of equal length

In [None]:

data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

In [None]:
frame = pd.DataFrame(data)  # Creating the DataFrame
print(frame)  # Displaying the DataFrame

Display the first 5 rows using head()

In [None]:

print("First 5 rows:")
print(frame.head())


Display the last 5 rows using tail()

In [None]:

print("Last 5 rows:")
print(frame.tail())

Specifying column order when creating a DataFrame

In [None]:

frame_reordered = pd.DataFrame(data, columns=["year", "state", "pop"])
print("DataFrame with reordered columns:")
print(frame_reordered)


Adding a new column 'debt' that does not exist in the original data

In [None]:
 
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
print("DataFrame with new 'debt' column (NaN values):")
print(frame2)

Displaying column names

In [None]:
 
print("Column names:")
print(frame2.columns)

Accessing a single column using dictionary-like notation

In [None]:

print("State column:")
print(frame2["state"])  # Returns a Series



Accessing a single column using dot notation

In [None]:

print("Year column:")
print(frame2.year)  # Works if column name is a valid Python identifier



Accessing a row using loc (by label/index)

In [None]:

print("Row at index 1:")
print(frame2.loc[1])

Accessing a row using iloc (by numerical index)

In [None]:

print("Row at index 2:")
print(frame2.iloc[2])



Modifying an entire column by assigning a scalar value

In [None]:

frame2["debt"] = 16.5
print("DataFrame after assigning a scalar value to 'debt' column:")
print(frame2)



Assigning an array to a column

In [None]:

frame2["debt"] = np.arange(6.)
print("DataFrame after assigning an array to 'debt' column:")
print(frame2)



Assigning a Series to a column (index alignment happens)

In [None]:

val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val  # Since indexes do not match, NaN values are inserted
print("DataFrame after assigning a Series to 'debt' column:")
print(frame2)

Adding a new column based on a condition

In [None]:

frame2["eastern"] = frame2["state"] == "Ohio"
print("DataFrame after adding 'eastern' column:")
print(frame2)



Deleting a column using 'del'

In [None]:

del frame2["eastern"]
print("DataFrame after deleting 'eastern' column:")
print(frame2)

Creating a nested dictionary

In [None]:
# 
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

Converting nested dictionary to DataFrame

In [None]:
# 
frame3 = pd.DataFrame(populations)
print(frame3)

Transposing the DataFrame (rows become columns and vice versa)

In [None]:
# 
print(frame3.T)

Creating DataFrame with explicit index

In [None]:
# 
print(pd.DataFrame(populations, index=[2001, 2002, 2003]))

Creating a dictionary of Series from the existing DataFrame

In [None]:
# 
pdata = {"Ohio": frame3["Ohio"][:-1], "Nevada": frame3["Nevada"][:2]}
print(pd.DataFrame(pdata))


Setting index and column names

In [None]:
# 
frame3.index.name = "year"
frame3.columns.name = "state"
print(frame3)

Converting DataFrame to NumPy array

In [None]:
# 
print(frame3.to_numpy())

Creating a Series with a custom index

In [None]:
# 
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
print(index)

Index slicing

In [None]:
# 
print(index[1:])

Index objects are immutable

In [None]:

index[1] = "d"  # This would raise a TypeError

Creating an Index object explicitly

In [None]:
labels = pd.Index(np.arange(3))
print(labels)

Using Index object in a Series

In [None]:
# 
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)

Checking if the same Index object is used

In [None]:

# 
print(obj2.index is labels)

checking for presence in Index

In [None]:
#
print("Ohio" in frame3.columns)
print(2003 in frame3.index)

Creating an Index with duplicate labels

In [None]:
# 
print(pd.Index(["foo", "foo", "bar", "bar"]))

Using Index methods

In [None]:
# 
index1 = pd.Index(["a", "b", "c"])
index2 = pd.Index(["c", "d", "e"])
print(index1.append(index2))  # Concatenating indexes
print(index1.difference(index2))  # Set difference
print(index1.intersection(index2))  # Set intersection
print(index1.union(index2))  # Set union

Checking properties of an Index

In [None]:
# 
monotonic_index = pd.Index([1, 2, 3])
print(monotonic_index.is_monotonic)  # True if sorted
print(monotonic_index.is_unique)  # True if no duplicates



Getting unique values from an Index

In [None]:

print(pd.Index([1, 2, 2, 3]).unique())


In [None]:
# Exercise: Pandas DataFrame Practice

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

# Part 1: Create DataFrame
# 1. Create a DataFrame `df1` from the following dictionary:
#   Assign the dictionary to `df1` and print it.
#   Hint: Use pd.DataFrame(data)

# TODO: Write your code here

# Part 2: Indexing and Accessing Columns
# 2. Access the "Age" column of the DataFrame `df1` and print it.
#    Hint: You can use df1['Age'] to access the 'Age' column.

# TODO: Write your code here

# 3. Access the first row of `df1` (i.e., the data of the person named "John") and print it.
#    Hint: Use df1.iloc[0] to access the first row.

# TODO: Write your code here

# Part 3: Adding a New Column
# 4. Add a new column `Salary` to the DataFrame `df1` with the following values:
#    [50000, 60000, 55000, 45000].
#    Print the updated DataFrame.
#    Hint: You can add a new column using df1['Salary'] = [50000, 60000, 55000, 45000]

# TODO: Write your code here

# Part 4: Modifying Data
# 5. Modify the `City` of the person named "Peter" to "Madrid". Print the updated DataFrame.
#    Hint: Use df1.loc[df1['Name'] == 'Peter', 'City'] = 'Madrid' to modify the value.

# TODO: Write your code here

# Part 5: Filtering Data
# 6. Filter the DataFrame `df1` to show only people who are older than 30. Print the filtered DataFrame.
#    Hint: Use df1[df1['Age'] > 30] for filtering the data.

# TODO: Write your code here

# Part 6: Drop a Row
# 7. Drop the row corresponding to "Anna" from the DataFrame `df1` and print the updated DataFrame.
#    Hint: Use df1 = df1[df1['Name'] != 'Anna'] to drop a row.

# TODO: Write your code here

# Part 7: Sorting the DataFrame
# 8. Sort the DataFrame `df1` by the "Age" column in descending order and print the sorted DataFrame.
#    Hint: Use df1.sort_values(by='Age', ascending=False) to sort by Age.

# TODO: Write your code here

## Data Manipulation with Pandas

Pandas provides powerful functions for filtering, sorting, and aggregating data.


In [None]:
# Filtering Data
filtered_df = df[df['Age'] > 30]

# Sorting Data
sorted_df = df.sort_values(by='Salary', ascending=False)

# Aggregation (Mean Salary)
mean_salary = df['Salary'].mean()

print("Filtered DataFrame:")
print(filtered_df)

print("\nSorted DataFrame:")
print(sorted_df)

print(f"\nMean Salary: {mean_salary}")