<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_6/Section_4_Python_Example__Data_Manipulation_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 4 - Python example: data manipulation with pandas

Pandas is a cornerstone of data manipulation in Python, offering an extensive suite of functions that enable efficient data operations suitable for different needs from simple data aggregation to complex transformations. This section provides a detailed Python example demonstrating various data manipulation techniques using Pandas. These techniques include indexing, handling missing data, filtering, aggregating, and merging datasets, essential for preparing data for analysis or machine learning.

1. Setting Up the Environment:

To perform data manipulation with Pandas, ensure your Python environment is equipped with the necessary library. If Pandas is not installed, it can be easily added using pip:

In [None]:
pip install pandas

2. Importing Pandas:

Start by importing Pandas, typically imported under the alias pd:

In [None]:
import pandas as pd

3. Creating and Loading Data:

For this example, let’s create a simple DataFrame. In practice, data might be loaded from a CSV file, SQL database, or another source:

In [None]:
# Creating a DataFrame
data = pd.DataFrame({ 'Name': ['John', 'Anna', 'Bob', 'Linda'], 'Age': [28, 22, 34, 29], 'Salary': [50000, 62000, 55000, 48000] })
print(data)

4. Basic Data Manipulation Techniques:

**Indexing and Selecting Data:**

Pandas offers multiple methods for indexing and selecting data, which are crucial for slicing data into manageable parts:

In [None]:
# Selecting a column
ages = data['Age']
print(ages)
# Selecting multiple columns
subset = data[['Name', 'Salary']]
print(subset)
# Conditional selections
over_30 = data[data['Age'] > 30]
print(over_30)

**Handling Missing Data:**

It’s common to encounter missing values in datasets. Pandas provides several methods to handle missing data effectively:

In [None]:
# Introduce missing values
data.loc[2, 'Salary'] = None
# Filling missing values
data['Salary'].fillna(value=data['Salary'].mean(), inplace=True)
print(data)
# Dropping rows with any missing values
clean_data = data.dropna()
print(clean_data)

**Data Filtering:**

Filtering involves specifying conditions to isolate subsets of data:

In [None]:
# Filtering data
high_earners = data[data['Salary'] > 50000]
print(high_earners)

**Data Aggregation:**

Pandas supports aggregation at various levels, essential for statistical analysis and summarization:

In [None]:
# Aggregating data
average_salary = data['Salary'].mean()
print("Average Salary:", average_salary)

5. Advanced Data Manipulation Techniques:

**Grouping Data:**

Grouping involves segmenting data into groups and applying a function to each group:

In [None]:
# Grouping data by a column and aggregating
grouped_data = data.groupby('Age').sum()
print(grouped_data)

**Merging Data:**

Combining datasets is a common operation, especially when dealing with relational data:

In [None]:
# Additional DataFrame
new_data = pd.DataFrame({ 'Name': ['John', 'Anna'], 'Bonus': [3000, 1500] })
# Merging DataFrames
merged_data = pd.merge(data, new_data, on='Name', how='left')
print(merged_data)

These examples underscore the breadth and depth of Pandas' capabilities for data manipulation. By mastering these techniques—from basic operations like selections and filtering to more complex operations like merging and grouping—analysts and data scientists can preprocess and transform data efficiently, preparing it for deeper analysis or predictive modeling. The versatility and power of Pandas make it an indispensable tool for anyone working with data in Python. However, if pandas is so good, why do we need NumPy?