# Introduction to Pandas for Data Science

Welcome to this tutorial on Pandas, a powerful library for data manipulation and analysis in Python. In this notebook, we'll explore how Pandas can be used in statistical analysis and data science, covering the basics from data loading to essential operations.

## What is Pandas?

Pandas is an open-source Python library providing high-performance, easy-to-use data structures, and data analysis tools. It's built on top of NumPy and integrates well with other libraries like Matplotlib and SciPy.

## Why Use Pandas in Statistics?

- **Data Handling**: Efficiently read and write data from various formats (CSV, Excel, SQL databases).
- **Data Manipulation**: Clean, transform, and merge datasets with ease.
- **Statistical Analysis**: Compute descriptive statistics and perform operations like grouping and aggregation.
- **Visualization**: Integrate with plotting libraries to visualize data trends.

---

# Getting Started

Let's begin by importing the Pandas library.

In [None]:
import pandas as pd

# Pandas Data Structures

Pandas introduces two primary data structures:

- **Series**: One-dimensional labeled array capable of holding any data type.
- **DataFrame**: Two-dimensional labeled data structure with columns of potentially different types.

## Creating a Series

In [None]:
# Creating a Series from a list
import pandas as pd

data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)

## Creating a DataFrame

In [None]:
# Creating a DataFrame from a dictionary
data = {
    'Age': [25, 30, 35, 40],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
}
df = pd.DataFrame(data)
print(df)

---

# Reading Data from Files

Pandas can read data from various file formats. The most common is CSV.

## Reading CSV Files

CSV (Comma-Separated Values) files are one of the most common file formats for storing tabular data. Pandas provides the `read_csv` function to read CSV files.

In [None]:
# Reading data from a CSV file
df_csv = pd.read_csv('data.csv')
print(df_csv.head())

## Reading Compressed CSV Files

Pandas can directly read compressed files (e.g., `.gz`, `.bz2`, `.zip`, `.xz`). If your data is in a compressed CSV file, you can read it without decompressing it first. Pandas will automatically detect the compression based on the file extension.

In [None]:
# Reading data from a compressed CSV file
df_csv_gz = pd.read_csv('data.csv.gz')
print(df_csv_gz.head())

## Example Dataset

For this tutorial, we'll use a sample dataset about students' scores.

In [None]:
# Sample data
data = {
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Math': [85, 78, 90, 95, 88],
    'English': [92, 76, 88, 85, 91],
    'Science': [89, 80, 86, 94, 90]
}
df = pd.DataFrame(data)
print(df)

---

# Saving Data

After manipulating data, you might want to save it back to a file.

In [None]:
# Saving to CSV
df.to_csv('updated_data.csv', index=False)

---

# Basic Data Exploration

## Viewing Data

- **Head and Tail**: View the first or last few rows.

In [None]:
print(df.head())    # First 5 rows
print(df.tail(3))   # Last 3 rows

## Summary Statistics

In [None]:
print(df.describe())

## Data Types

In [None]:
print(df.dtypes)

---

# Data Selection and Indexing

## Selecting Columns

In [None]:
# Selecting a single column
math_scores = df['Math']
print(math_scores)

## Selecting Multiple Columns

In [None]:
# Selecting multiple columns
scores = df[['Student', 'Math', 'Science']]
print(scores)

## Row Selection

### By Label with `.loc`

In [None]:
# Selecting rows by label
row = df.loc[2]
print(row)

### By Position with `.iloc`

In [None]:
# Selecting rows by integer location
row = df.iloc[2]
print(row)

## Conditional Selection

In [None]:
# Students with Math score above 85
high_math = df[df['Math'] > 85]
print(high_math)

---

# Data Manipulation

## Adding New Columns

In [None]:
# Calculating the average score
df['Average'] = df[['Math', 'English', 'Science']].mean(axis=1)
print(df)

## Renaming Columns

In [None]:
# Renaming the 'Student' column to 'Name'
df.rename(columns={'Student': 'Name'}, inplace=True)
print(df)

## Handling Missing Data

In [None]:
# Let's introduce some missing data
df.loc[2, 'Science'] = None

# Checking for null values
print(df.isnull())

# Filling missing values with the mean
df['Science'].fillna(df['Science'].mean(), inplace=True)
print(df)

---

# Grouping and Aggregating Data

## Group By

Suppose we have an additional column for class section.

In [None]:
df['Section'] = ['A', 'B', 'A', 'B', 'A']
print(df)

### Calculating Grouped Statistics

In [None]:
# Average scores by section
grouped = df.groupby('Section').mean()
print(grouped)

---

# Data Visualization with Pandas

Pandas integrates well with Matplotlib for plotting.

In [None]:
import matplotlib.pyplot as plt

# Plotting the average scores
df.plot(x='Name', y='Average', kind='bar')
plt.ylabel('Average Score')
plt.title('Average Scores of Students')
plt.show()

---

# Practical Example: Analyzing a Dataset

Let's work through a practical example using a real dataset.

## Loading the Titanic Dataset

The Titanic dataset is a classic in data science.

In [None]:
# Load dataset directly from seaborn library
import seaborn as sns

titanic = sns.load_dataset('titanic')
print(titanic.head())

## Exploring the Data

In [None]:
# Basic info
print(titanic.info())

# Statistical summary
print(titanic.describe())

## Cleaning the Data

In [None]:
# Drop unnecessary columns
titanic.drop(columns=['deck', 'embark_town'], inplace=True)

# Fill missing age values with the median
titanic['age'].fillna(titanic['age'].median(), inplace=True)

## Analysis: Survival Rate by Gender

In [None]:
# Survival rate by gender
survival_by_gender = titanic.groupby('sex')['survived'].mean()
print(survival_by_gender)

## Visualization

In [None]:
# Bar plot of survival rate by gender
survival_by_gender.plot(kind='bar')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender on Titanic')
plt.show()

---

# Conclusion

In this tutorial, we've covered:

- The basics of Pandas data structures: Series and DataFrame.
- Reading data from files, including CSV and compressed CSV files.
- Exploring and summarizing data.
- Data selection, indexing, and manipulation.
- Grouping and aggregating data.
- Basic data visualization.

Pandas is an essential tool in a data scientist's toolkit, enabling efficient data analysis and manipulation. As you progress, you'll discover more advanced features and integrations with other libraries.

# Additional Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)

Feel free to experiment with the code and explore datasets relevant to your interests!