# Introduction to Pandas

Pandas is a powerful and flexible open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work on structured data seamlessly. The primary data structures in pandas are:

- **Series**: A one-dimensional labeled array capable of holding any data type.
- **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types, similar to a table in a relational database or a spreadsheet.

Pandas is widely used for data cleaning, transformation, and analysis, making it an essential tool in data science and machine learning.

## Operations Covered in This Notebook

In this notebook, we will explore several key operations using a `DataFrame`, which is one of the core data structures in pandas. Here’s a summary of what we will cover:

### Creating a DataFrame

- Learn how to create a `DataFrame` from a dictionary of lists or arrays.
- Understand the structure and content of the `DataFrame`.

### Accessing Columns

- Access individual columns by their labels.
- Work with column data for further analysis.

### Adding New Columns

- Add new columns to an existing `DataFrame`.
- Explore the impact of adding data to the `DataFrame`.

### Calculating Basic Statistics

- Use the `describe()` method to get summary statistics of numerical columns.
- Understand common statistical measures such as mean, standard deviation, and percentiles.

### Accessing Rows

- Retrieve specific rows using `loc` (label-based indexing) and `iloc` (integer-based indexing).
- Learn to select and inspect row data.

### Filtering Data

- Filter rows based on conditions.
- Perform data selection based on logical criteria.

### Grouping Data

- Group data by one or more columns using the `groupby()` method.
- Compute aggregate functions like mean and sum for grouped data.

### Handling Missing Data

- Identify and handle missing values in the `DataFrame`.
- Use methods to drop or fill missing values.

### Saving Data

- Save the `DataFrame` to a CSV file for persistence.
- Understand the importance of saving data for later use.

### Reading Data

- Read data from a CSV file into a `DataFrame`.
- Explore how to load external data into pandas for analysis.


In [1]:
import pandas as pd

In [2]:
# create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print("DataFrame:\n", df)

DataFrame:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [3]:
# access a column
print("\nAges:\n", df['Age'])


Ages:
 0    25
1    30
2    35
Name: Age, dtype: int64


In [4]:
# adding a new column
df['Salary'] = [70000, 80000, 90000]
print("\nDataFrame with Salary:\n", df)



DataFrame with Salary:
       Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000


In [5]:
# calculating basic statistics
print("\nStatistics:\n", df.describe())



Statistics:
         Age   Salary
count   3.0      3.0
mean   30.0  80000.0
std     5.0  10000.0
min    25.0  70000.0
25%    27.5  75000.0
50%    30.0  80000.0
75%    32.5  85000.0
max    35.0  90000.0


In [6]:
# accessing a row by label
print("\nRow 1:\n", df.loc[1])

# accessing a row by integer index
print("\nRow 1 (using iloc):\n", df.iloc[1])



Row 1:
 Name              Bob
Age                30
City      Los Angeles
Salary          80000
Name: 1, dtype: object

Row 1 (using iloc):
 Name              Bob
Age                30
City      Los Angeles
Salary          80000
Name: 1, dtype: object


In [7]:
# filtering data
#For example, to find all rows where Age is greater than 30

filtered_df = df[df['Age'] > 30]
print("\nFiltered DataFrame:\n", filtered_df)



Filtered DataFrame:
       Name  Age     City  Salary
2  Charlie   35  Chicago   90000


In [9]:
# grouping  by 'City' and calculate mean salary
#pandas allows us to group data and perform aggregate operations. For example, grouping by 'City'
grouped_df = df.groupby('City').mean()
print("\nGrouped DataFrame:\n", grouped_df)



Grouped DataFrame:
               Age   Salary
City                      
Chicago      35.0  90000.0
Los Angeles  30.0  80000.0
New York     25.0  70000.0


In [10]:
# drop rows with missing values
df_cleaned = df.dropna()
print("\nDataFrame after dropping missing values:\n", df_cleaned)



DataFrame after dropping missing values:
       Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000


In [11]:
# saving DataFrame to CSV
df.to_csv('data.csv', index=False)

In [12]:
# reading CSV file into DataFrame
df_read = pd.read_csv('data.csv')
print("\nDataFrame read from CSV:\n", df_read)



DataFrame read from CSV:
       Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
