<a href="https://colab.research.google.com/github/egynzhu-personal/siop-python-seminar-2024/blob/main/Pandas_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**2 Introduction to Pandas**

Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures and functions to efficiently handle structured data.

***2.1 Series and DataFrame***

Pandas primarily deals with two main data structures: Series and DataFrame.

In [None]:
import pandas as pd

# Creating a Series
data = [1, 2, 3, 4, 5]
s = pd.Series(data)
s

In [None]:
# Creating a DataFrame
data = {'First Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Last Name': ['Clemons', 'Joseph', 'Bernard', 'Williams'],
        'Age': [25, 30, 35, 40],
        'Base Salary': [50000, 60000, 70000, 80000],
        'Bonus': [1000, 2000, 3000, 4000]}
df = pd.DataFrame(data)
df

***2.2 Reading and Writing Data***

Pandas can read data from various file formats such as CSV, Excel, and SQL databases.

In [None]:
# Reading data from CSV
df = pd.read_csv('https://github.com/egynzhu-personal/siop-python-seminar-2024/blob/main/data/salary.csv?raw=true')

df.head(10)

In [4]:
# Writing data to CSV
df.to_csv('salary_copy.csv', index=False)

***2.3 Handling Missing Values***

Missing values are common in real-world datasets. Pandas provides functions to handle missing data.

In [None]:
# Checking for missing values
df.isnull()

In [None]:
# Filling missing values with a specific value
df.fillna(0)

In [None]:
# Filling missing values with a function
df.fillna(round(df.mean(axis=0, numeric_only=True), 1))

In [None]:
# Dropping rows with missing values
df.dropna(inplace=True)
df

***2.4 Data Manipulation***

Pandas allows various operations for data manipulation including selection, filtering, and aggregation.

In [None]:
# Selecting columns
df['Age']

In [None]:
# Selecting multiple columns
df[['First Name', 'Base Salary']]

In [None]:
# Filtering rows based on a condition
df[df['Age'] > 30]

In [None]:
# Multiple conditions
df[(df['Age'] > 30) & (df['Base Salary'] > 60000)]

In [15]:
# Creating a new column based on a condition
df['High Salary'] = df['Base Salary'] > 60000

In [None]:
# Applying functions to columns
df['Full Name'] = df[['First Name', 'Last Name']].agg(' '.join, axis=1)

# Simple mathematical operations can be applied directly to columns
df['Total Compensation'] = df['Base Salary'] + df['Bonus']
df

In [None]:
# Set a column as index
df.index = df['Full Name']
df

In [None]:
# Droping a column
df.drop('Full Name', axis=1, inplace=True)
df

In [None]:
# Sorting DataFrame by a column
df_sorted = df.sort_values(by='Total Compensation', ascending=False)
df_sorted

In [None]:
# Grouping and Aggregation
# Grouping data by a column and calculating aggregate functions
grouped_data = df.groupby('High Salary').mean(numeric_only=True)
grouped_data

**Activity**: Read the data from the CSV file named 'sales_data.csv' using the following line.
```
pd.read_csv('https://github.com/egynzhu-personal/siop-python-seminar-2024/blob/main/data/sales_data.csv?raw=true')
```
Then perform the following operations:

1. Display the first 5 rows of the DataFrame.
2. Impute missing data with row means (axis=1).
3. Create a new column named "Total Sales" and calculate the total sales for each row.
4. Filter for rows with total sales > 250 and January sales > 0.
5. Group the data by the "Region" column and calculate the average sales for each region.

In [None]:
# Activity
df = pd.read_csv('https://github.com/egynzhu-personal/siop-python-seminar-2024/blob/main/data/sales_data.csv?raw=true')

Pandas is an essential library for data analysis and manipulation in Python. It simplifies many data-related tasks and provides powerful tools for working with structured data.