#### Pandas Tutorial - Part 54

This notebook covers:
- Creating dummy variables with `str.get_dummies()`
- Plotting with pandas: bar and barh plots

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

##### Creating Dummy Variables with `str.get_dummies()`

The `str.get_dummies()` method splits each string in the Series by a separator and returns a DataFrame of dummy/indicator variables.

In [None]:
# Create a Series with pipe-separated values
s = pd.Series(['a|b', 'a', 'a|c'])
print("Original Series:")
print(s)

In [None]:
# Convert to dummy variables
dummies = s.str.get_dummies()
print("Result of get_dummies():")
print(dummies)

In [None]:
# Create a Series with comma-separated values
s_comma = pd.Series(['red,blue', 'green', 'yellow,orange,purple', 'blue,green'])
print("Series with comma-separated values:")
print(s_comma)

In [None]:
# Convert to dummy variables with custom separator
dummies_comma = s_comma.str.get_dummies(sep=',')
print("Result of get_dummies(sep=','):")
print(dummies_comma)

In [None]:
# Create a Series with missing values
s_missing = pd.Series(['a|b', np.nan, 'a|c'])
print("Series with missing values:")
print(s_missing)

In [None]:
# Convert to dummy variables
dummies_missing = s_missing.str.get_dummies()
print("Result of get_dummies() with missing values:")
print(dummies_missing)

### Practical Example: Analyzing Movie Genres

In [None]:
# Create a DataFrame with movie data
movies = pd.DataFrame({
    'title': ['The Shawshank Redemption', 'The Godfather', 'Pulp Fiction', 'The Dark Knight', 'Forrest Gump'],
    'year': [1994, 1972, 1994, 2008, 1994],
    'genres': ['Drama', 'Crime|Drama', 'Crime|Drama|Thriller', 'Action|Crime|Drama|Thriller', 'Drama|Romance']
})
print("Movies DataFrame:")
print(movies)

In [None]:
# Convert genres to dummy variables
genre_dummies = movies['genres'].str.get_dummies(sep='|')
print("Genre dummy variables:")
print(genre_dummies)

In [None]:
# Combine the original DataFrame with the dummy variables
movies_with_dummies = pd.concat([movies, genre_dummies], axis=1)
print("Movies DataFrame with dummy variables:")
print(movies_with_dummies)

In [None]:
# Count the number of movies in each genre
genre_counts = genre_dummies.sum().sort_values(ascending=False)
print("Genre counts:")
print(genre_counts)

##### Plotting with Pandas: Bar and Barh Plots

Pandas provides convenient plotting methods that are built on top of matplotlib. Let's explore bar and horizontal bar plots.

### Bar Plots

In [None]:
# Create a simple DataFrame for plotting
df = pd.DataFrame({
    'lab': ['A', 'B', 'C'], 
    'val': [10, 30, 20]
})
print("Simple DataFrame:")
print(df)

In [None]:
# Create a basic bar plot
ax = df.plot.bar(x='lab', y='val', rot=0, title='Basic Bar Plot')
plt.show()

In [None]:
# Create a DataFrame with animal data
speed = [0.1, 17.5, 40, 48, 52, 69, 88]
lifespan = [2, 8, 70, 1.5, 25, 12, 28]
index = ['snail', 'pig', 'elephant', 'rabbit', 'giraffe', 'coyote', 'horse']
df_animals = pd.DataFrame({
    'speed': speed,
    'lifespan': lifespan
}, index=index)
print("Animals DataFrame:")
print(df_animals)

In [None]:
# Plot the entire DataFrame as a bar plot
ax = df_animals.plot.bar(rot=0, title='Animal Speed and Lifespan')
plt.xlabel('Animal')
plt.ylabel('Value')
plt.show()

In [None]:
# Plot a single column
ax = df_animals.plot.bar(y='speed', rot=0, title='Animal Speed', color='green')
plt.xlabel('Animal')
plt.ylabel('Speed')
plt.show()

In [None]:
# Plot with subplots
axes = df_animals.plot.bar(rot=0, subplots=True, figsize=(10, 8), layout=(2, 1))
axes[0].set_title('Animal Speed')
axes[1].set_title('Animal Lifespan')
plt.tight_layout()
plt.show()

### Horizontal Bar Plots

In [None]:
# Create a basic horizontal bar plot
ax = df.plot.barh(x='lab', y='val', title='Basic Horizontal Bar Plot')
plt.show()

In [None]:
# Plot the entire DataFrame as a horizontal bar plot
ax = df_animals.plot.barh(title='Animal Speed and Lifespan')
plt.xlabel('Value')
plt.ylabel('Animal')
plt.show()

In [None]:
# Plot a single column
ax = df_animals.plot.barh(y='lifespan', title='Animal Lifespan', color='purple')
plt.xlabel('Lifespan (years)')
plt.ylabel('Animal')
plt.show()

In [None]:
# Plot with subplots
axes = df_animals.plot.barh(subplots=True, figsize=(10, 8), layout=(2, 1))
axes[0].set_title('Animal Speed')
axes[1].set_title('Animal Lifespan')
plt.tight_layout()
plt.show()

### Customizing Bar Plots

In [None]:
# Customized bar plot
ax = df_animals['speed'].plot.bar(
    rot=45,
    title='Animal Speed (Customized)',
    figsize=(10, 6),
    color='skyblue',
    edgecolor='black',
    alpha=0.7
)
plt.xlabel('Animal', fontsize=12)
plt.ylabel('Speed (km/h)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Customized horizontal bar plot
ax = df_animals['lifespan'].plot.barh(
    title='Animal Lifespan (Customized)',
    figsize=(10, 6),
    color='salmon',
    edgecolor='black',
    alpha=0.7
)
plt.xlabel('Lifespan (years)', fontsize=12)
plt.ylabel('Animal', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

### Practical Example: Visualizing Genre Counts

In [None]:
# Plot genre counts as a bar plot
ax = genre_counts.plot.bar(
    title='Movie Genre Counts',
    figsize=(10, 6),
    color='lightgreen',
    edgecolor='black',
    alpha=0.7
)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Plot genre counts as a horizontal bar plot
ax = genre_counts.plot.barh(
    title='Movie Genre Counts',
    figsize=(10, 6),
    color='lightblue',
    edgecolor='black',
    alpha=0.7
)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

##### Conclusion

In this notebook, we've explored:

1. `str.get_dummies()`: A powerful method for converting categorical string data with separators into dummy/indicator variables, which is useful for:
   - One-hot encoding of categorical variables
   - Handling multi-label data where each observation can belong to multiple categories
   - Preparing data for machine learning models

2. Pandas plotting capabilities, specifically bar and horizontal bar plots:
   - Creating basic bar plots with `plot.bar()`
   - Creating horizontal bar plots with `plot.barh()`
   - Customizing plots with various parameters
   - Creating subplots for multiple columns
   - Practical applications for data visualization

These tools are essential for data preprocessing and visualization in pandas, allowing for flexible and powerful operations on your data.