# Pandas Data Wrangling Demo

This notebook demonstrates various pandas operations and data wrangling techniques, structured to match the LIVE DEMO sections from the lecture.

## 1. Basic Pandas Operations and Data Structures

In this section, we'll cover the fundamental pandas data structures (Series and DataFrame) and basic operations.

In [None]:
import pandas as pd
import numpy as np

# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:")
print(s)

A pandas Series is a one-dimensional labeled array that can hold data of any type.

In [None]:
# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': ['p', 'q', 'r']
})
print("\nDataFrame:")
print(df)

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

In [None]:
# Basic operations
print("\nSelect column 'A':")
print(df['A'])

print("\nFilter rows where A > 1:")
print(df[df['A'] > 1])

We can select specific columns and filter rows based on conditions.

In [None]:
# Add a new column
df['D'] = df['A'] * 2
print("\nDataFrame with new column 'D':")
print(df)

New columns can be added to a DataFrame by assigning values to a new column name.

In [None]:
# Handle missing data
df.loc[1, 'B'] = np.nan
print("\nFill missing values with mean:")
print(df['B'].fillna(df['B'].mean()))

Pandas provides methods to handle missing data, such as `fillna()` to replace NaN values.

In [None]:
# Convert data types
df['C'] = df['C'].astype('category')
print("\nData types:")
print(df.dtypes)

We can convert column data types using the `astype()` method.

## 2. Combining and Reshaping Data

This section covers techniques for combining multiple DataFrames and reshaping data.

In [None]:
# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Concatenate DataFrames
result = pd.concat([df1, df2])
print("Concatenated DataFrame:")
print(result)

`pd.concat()` is used to concatenate DataFrames along a particular axis.

In [None]:
# Merge DataFrames
left = pd.DataFrame({'key': ['K0', 'K1'], 'A': ['A0', 'A1']})
right = pd.DataFrame({'key': ['K0', 'K2'], 'B': ['B0', 'B2']})
merged = pd.merge(left, right, on='key', how='outer')
print("\nMerged DataFrame:")
print(merged)

`pd.merge()` is used to merge DataFrames based on a common key.

In [None]:
# Reshape data: Melt
df = pd.DataFrame({'A': ['a', 'b'], 'B': [1, 3], 'C': [2, 4]})
melted = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
print("\nMelted DataFrame:")
print(melted)

`pd.melt()` is used to reshape data from wide to long format.

In [None]:
# Reshape data: Pivot
pivoted = melted.pivot(index='A', columns='variable', values='value')
print("\nPivoted DataFrame:")
print(pivoted)

`pivot()` is used to reshape data from long to wide format.

## 3. Data Cleaning Techniques

This section demonstrates various techniques for cleaning and preprocessing data.

In [None]:
# Create a DataFrame with issues
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, 6, 7, np.nan, 9],
    'C': ['a', 'b', 'a', 'c', 'b'],
    'D': pd.date_range(start='2023-01-01', periods=5)
})

# Handle missing data
print("Original DataFrame:")
print(df)
print("\nDropping rows with missing values:")
print(df.dropna())

`dropna()` is used to remove rows with missing values.

In [None]:
# Remove duplicates
df = df.append(df.iloc[0])  # Add a duplicate row
print("\nRemoving duplicates:")
print(df.drop_duplicates())

`drop_duplicates()` is used to remove duplicate rows from a DataFrame.

In [None]:
# String operations
df['C'] = df['C'].str.upper()
print("\nAfter string operation:")
print(df)

String methods can be applied to string columns using the `str` accessor.

In [None]:
# Date operations
df['Year'] = df['D'].dt.year
print("\nAfter extracting year:")
print(df)

Date and time operations can be performed using the `dt` accessor.

In [None]:
# Categorical data
df['C'] = df['C'].astype('category')
print("\nData types after conversion:")
print(df.dtypes)

Converting appropriate columns to categorical type can improve memory usage and performance.

## 4. Advanced Data Wrangling Techniques

This section covers more advanced data manipulation techniques.

In [None]:
# Custom function application
def celsius_to_fahrenheit(celsius):
    return (celsius * 9/5) + 32

df = pd.DataFrame({'Celsius': [0, 10, 20, 30]})
df['Fahrenheit'] = df['Celsius'].apply(celsius_to_fahrenheit)
print("After applying custom function:")
print(df)

The `apply()` method allows us to apply custom functions to DataFrame columns.

In [None]:
# Pivot table
df = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar'],
    'B': ['one', 'two', 'one', 'two'],
    'C': [1, 2, 3, 4]
})
pivot = df.pivot_table(values='C', index='A', columns='B', aggfunc='sum')
print("\nPivot table:")
print(pivot)

Pivot tables are used to create spreadsheet-style pivot tables as a DataFrame.

In [None]:
# Data validation
valid_categories = ['foo', 'bar']
df['is_valid'] = df['A'].isin(valid_categories)
print("\nAfter data validation:")
print(df)

The `isin()` method is useful for checking if each element in a Series is contained in a list.

In [None]:
# Binning data
df = pd.DataFrame({'value': [1, 5, 10, 15, 20]})
df['bin'] = pd.cut(df['value'], bins=[0, 5, 15, 20], labels=['low', 'medium', 'high'])
print("\nAfter binning:")
print(df)

The `cut()` function is used to bin values into discrete intervals.