# Pandas Tutorial
A comprehensive yet beginner-friendly tutorial on **pandas**, a popular Python library for data manipulation and analysis.

In this tutorial, we will cover:

    - Installation and import of the pandas library.
    - An introduction to Pandas Series, highlighting its similarity to NumPy arrays.
    - Creating DataFrames from various data sources.
    - Basic data inspection, selection, indexing, and filtering.
    - Modifying DataFrames and performing calculations.
    - Grouping, merging, and finally saving/loading data in different formats.

  **Note:** Remember that a pandas DataFrame can be thought of as a collection of Series objects, where each column is a Series.

In [None]:
# No need to install pandas, it is already included in our environment.
# However if you are not using our environment, you can install pandas using the command:

#!pip install pandas



 ## 1. Installation and Import



 First, install pandas if it is not already installed, then import it into your Python environment.

In [None]:
import pandas as pd


 ## 1.1. Pandas Series: An Introduction



 A **Pandas Series** is a one-dimensional labeled array capable of holding any data type. If you're already familiar with NumPy arrays, you'll notice that a Series behaves similarly but with added flexibility through indexing (labels for each element).



 In fact, a DataFrame is essentially a collection of Series objects (each column is a Series), which means many operations applicable to arrays can also be performed on Series.

In [None]:
# Creating a Pandas Series from a list
data_series = pd.Series([10, 20, 30, 40])
display("Pandas Series:")
display(data_series)

# Demonstrate that DataFrame columns are Series
df_series_example = pd.DataFrame({
    "Numbers": data_series, 
    "Squared": data_series ** 2
})
display("\nDataFrame constructed from Series:")
display(df_series_example)


 ## 2. Creating DataFrames



 A **DataFrame** is the core data structure in pandas — think of it as a table with rows and columns. You can create a DataFrame from various sources. Below are a few common methods:

 ### 2.1. From a Dictionary of Lists



 Here, each key in the dictionary represents a column name, and the corresponding value is a list of data for that column.

In [None]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
display(df)


 ### 2.2. From a List of Dictionaries



 In this approach, each dictionary in the list represents a row of data.

In [None]:
data_list = [
    {"Name": "Alice",   "Age": 25, "City": "New York"},
    {"Name": "Bob",     "Age": 30, "City": "Los Angeles"},
    {"Name": "Charlie", "Age": 35, "City": "Chicago"}
]
df2 = pd.DataFrame(data_list)
display(df2)



 ## 3. Basic Data Inspection



 After creating or loading a DataFrame, it's important to inspect your data. Common methods include:



 - **`df.head()`**: View the first few rows.

 - **`df.tail()`**: View the last few rows.

 - **`df.shape`**: Get the number of rows and columns.

 - **`df.columns`**: List all column names.

 - **`df.info()`**: Get a summary including data types and non-null counts.

 - **`df.describe()`**: Compute basic statistics for numerical columns.

In [None]:
display("First 5 rows:")
display(df.head())       # First 5 rows (use df.head(10) for the first 10)
display("\nLast 5 rows:")
display(df.tail())       # Last 5 rows
display("\nShape of DataFrame:")
display(df.shape)        # (rows, columns)
display("\nColumn Names:")
display(df.columns)      # List of column names
display("\nDataFrame Info:")
display(df.info())       # Summary of the DataFrame (types, non-null counts)
display("\nStatistical Summary:")
display(df.describe())   # Basic statistics for numeric columns


 ### Knowledge Check: DataFrame Inspection



 Consider the DataFrame you just inspected. Write code to:

 1. Print the first 3 rows using an alternative method.

 2. Retrieve the list of column names.

 3. Summarize the DataFrame using `.info()`.



 *Hint: Use the appropriate DataFrame methods to achieve these tasks.*

In [None]:
# Your solution here:
# 1. Print the first 3 rows.
# 2. Print the list of column names.
# 3. Display the DataFrame information.


 ## 4. Selecting and Indexing Data



 Pandas offers multiple ways to select or filter data within a DataFrame.



 ### 4.1. Dot Notation / Bracket Notation



 - **Dot Notation**: Simplifies access for columns with simple names.

 - **Bracket Notation**: More flexible; it supports column names with spaces or special characters.

In [None]:
# Dot notation (for simple column names without spaces/special chars)
display("Using dot notation to access 'Age':")
display(df.Age)

# Bracket notation
display("\nUsing bracket notation to access 'Age':")
display(df["Age"])


 ### 4.2. Row Selection with `.loc` and `.iloc`



 - **`.loc`** selects rows and columns by **label**.

 - **`.iloc`** selects rows and columns by **integer position**.

In [None]:
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Dave"],
    "Age": [25, 30, 35, 28],
    "City": ["NY", "LA", "Chicago", "Seattle"]
}, index=["row1", "row2", "row3", "row4"])  # custom index labels

display("Using .loc (label-based):")
display(df.loc["row2"])               # Entire row labeled 'row2'
display(df.loc["row2", "Age"])        # Specific cell (row2, Age)
display(df.loc["row1":"row3"])        # Slice multiple rows by label
display(df.loc[:, ["Name", "City"]])  # All rows, only these columns

display("\nUsing .iloc (integer-based):")
display(df.iloc[1])                   # 2nd row (since indexing starts at 0)
display(df.iloc[1, 1])                # Cell at row index 1, col index 1
display(df.iloc[0:2])                 # Rows 0 to 1
display(df.iloc[:, [0, 2]])           # All rows, columns 0 and 2


 ## 5. Filtering Rows



 Filtering rows lets you extract data based on specific conditions.



 ### 5.1. Boolean Masking



 Create a boolean condition that returns `True/False` for each row, then use that mask to filter the DataFrame.

In [None]:
# Show only rows where Age > 28
mask = df["Age"] > 28
older_than_28 = df[mask]
display("Rows where Age > 28:")
display(older_than_28)


 ### 5.2. Multiple Conditions



 Combine conditions using bitwise operators:

 - `&` for AND

 - `|` for OR

 - `~` for NOT

In [None]:
# People older than 25 AND living in NY
df_filtered = df[(df["Age"] > 25) & (df["City"] == "NY")]
display("Rows where Age > 25 and City is NY:")
display(df_filtered)


 Alternatively, you can use the `query()` method for more complex filtering.

 Check the official documentation for more details.

 https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query

In [None]:
df_filtered_query = df.query("Age > 25 and City == 'NY'")
display("Rows where Age > 25 and City is NY (using query):")
display(df_filtered_query)


 ### Knowledge Check: Filtering Rows



 Using the DataFrame `df`:

 1. Create a boolean mask to filter rows where the 'Age' is between 26 and 32 (inclusive).

 2. Additionally, filter rows where the 'City' starts with either 'C' or 'N'.

 3. Print the resulting DataFrame.



 *Hint: Use string methods like `.str.startswith()` on the 'City' column along with logical operators.*

In [None]:
# Your solution here:
# For example, create your mask and apply it to df.


 ## 6. Changing Values



 You can modify DataFrame values using various methods:



 ### 6.1. Assigning with `.loc`



 Modify values by referencing labels.

In [None]:
df.loc["row1", "Age"] = 26
display("After modifying using .loc:")
display(df)


 ### 6.2. Assigning with `.iloc`



 Modify values by referencing integer positions.

In [None]:
df.iloc[0, 1] = 27
display("After modifying using .iloc:")
display(df)


 ### 6.3. Vectorized Assignments



 Apply operations across entire columns efficiently.

In [None]:
# Increase everyone's Age by 1
df["Age"] = df["Age"] + 1
display("After increasing Age by 1:")
display(df)


 ### 6.4. Using `apply()`



 Apply a function to each element in a Series or DataFrame and return a new Series or DataFrame.



In [None]:
df["Age"] = df["Age"].apply(lambda x: x*x)
display("After applying lambda function to square Age:")
display(df)

# ### 6.5. Creating new columns
#
# Create a new column based on existing columns.

In [None]:
df["Age in 5 years"] = df["Age"] + 5
display("After creating new column 'Age in 5 years':")
display(df)


 Sometimes this direct assignment may lead to problems. In particular if you are modifying a view of a DataFrame, it may not behave as expected.

 In this case you should use `df.loc[]` to ensure you are modifying the original DataFrame.

 An alternative is to copy the DataFrame first using `df.copy()`.

In [None]:
# For example if you modify a slice of a DataFrame, it may not behave as expected.
df_slice = df.query("Age > 25")
df_slice["Age plus 1"] = df_slice["Age"] + 1
display("After modifying a slice of the DataFrame:")
display(df_slice)


 The correct way to do this is to use `df.loc[]` or copy the DataFrame first.

In [None]:
# Modify the original DataFrame using .loc
df_slice = df.query("Age > 25")
df.loc[df_slice.index, "Age"] = df_slice["Age"] + 1
display("After modifying the original DataFrame using .loc:")
display(df)

# Or copy the DataFrame first
df_slice = df.query("Age > 25").copy()
df_slice["Age plus 1"] = df_slice["Age"] + 1
display("After modifying a copy of the slice of the DataFrame:")
display(df_slice)



 ## 7. Calculating Simple Statistics and Value Counts



 Pandas provides simple methods to compute statistics and count occurrences:



 ### 7.1. Simple Statistics



 Calculate basic statistics such as mean, maximum, and minimum.

In [None]:
display("Average Age:", df["Age"].mean())  # Average age
display("Max Age:", df["Age"].max())         # Maximum age
display("Min Age:", df["Age"].min())         # Minimum age


 ### 7.2. `value_counts()`



 Count the occurrence of unique values in a Series.

In [None]:
city_counts = df["City"].value_counts()
display("City counts:")
display(city_counts)


 ## 8. Grouping and Aggregation



 Use `.groupby()` to split data into groups based on certain criteria, apply functions to each group, and combine the results.



 In the example below, we group the DataFrame by 'City' and calculate the mean Salary for each group.

In [None]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dave"],
    "Age": [25, 30, 35, 28],
    "City": ["NY", "LA", "NY", "LA"],
    "Salary": [70000, 80000, 120000, 95000]
}
df = pd.DataFrame(data)

# Group by 'City' and calculate mean Salary
grouped = df.groupby("City")["Salary"].mean()
display("Mean Salary by City:")
display(grouped)


 ## 9. Merging / Joining DataFrames



 Merge or join multiple DataFrames using pandas methods.



 ### 9.1. The `merge()` Method



 Merge two DataFrames on a common key.

In [None]:
df_left = pd.DataFrame({
    "PersonID": [1, 2, 3],
    "Name": ["Alice", "Bob", "Charlie"]
})

df_right = pd.DataFrame({
    "PersonID": [1, 2, 4],
    "City": ["NY", "LA", "Houston"]
})

merged_df = pd.merge(df_left, df_right, on="PersonID", how="inner")
display("Merged DataFrame (inner join):")
display(merged_df)


 ### 9.2. Joins on Different Column Names



 If the key column has different names in each DataFrame, use the `left_on` and `right_on` parameters.

In [None]:
# Example: Uncomment and modify the following line if your DataFrames have different key names.
# pd.merge(df_left, df_right, left_on="PersonID", right_on="ID")



 ## 10. Saving and Loading Data



 Pandas allows you to easily save DataFrames to various file formats and load them back into your program. Below are examples for saving to CSV and Feather formats:



 - **CSV Format**: A widely used text-based format.

 - **Feather Format**: A fast, lightweight, language-independent binary format (requires `pyarrow`).



 Pandas supports many other formats as well, including Excel, JSON, and SQL!



In [None]:
# Save DataFrame to CSV
df.to_csv("saved_data.csv", index=False)
display("DataFrame saved to CSV file: saved_data.csv")


In [None]:
# Save DataFrame to Feather format (ensure you have pyarrow installed: pip install pyarrow)
df.to_feather("saved_data.feather")
display("DataFrame saved to Feather file: saved_data.feather")


In [None]:
# Loading the saved CSV file
df_loaded_csv = pd.read_csv("saved_data.csv")
display("CSV file loaded:")
display(df_loaded_csv)


In [None]:
# Loading the saved Feather file
df_loaded_feather = pd.read_feather("saved_data.feather")
display("Feather file loaded:")
display(df_loaded_feather)


 ## 11. Exercises



 Practice what you have learned with the following exercises:



 1. Create a DataFrame from a dictionary of lists with at least three columns.

 2. Load a CSV file into a DataFrame and inspect its first few rows.

 3. Filter rows where a numeric column exceeds a certain threshold.

 4. Perform a group-by operation and calculate the mean of another column.

 5. Merge two DataFrames on a common key.



 **For each exercise, write your code in the provided cells.**

In [None]:
# 1. Create a DataFrame from a dictionary of lists.


In [None]:
# 2. Load a CSV file and inspect its first few rows.


In [None]:
# 3. Filter rows where a numeric column exceeds a threshold.


In [None]:
# 4. Perform a group-by operation and calculate the mean of another column.


In [None]:
# 5. Merge two DataFrames on a common key.
