# Data Manipulation and Visualization with Python

Use this worksheet during the video lecture. For the sake of time, much of the dataframe construction is left. However, a lot of this notebooks still requires you to fill it in while watching the lecture video! 

In this notebook, we will cover data manipulation and visualization using Python. We will use the pandas library for data manipulation and the matplotlib and seaborn libraries for data visualization.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style for plots
sns.set(style="whitegrid")

# Reading in a Dataset and Gathering Basic Information

Let's start by reading in a CSV file and gathering basic information about the dataset.

In [None]:
# insert here 

## Basic Information about the DataFrame

Here are some good ways to get basic information about a dataframe in Python:

- `head()`: Displays the first few rows of the dataframe.
- `tail()`: Displays the last few rows of the dataframe.
- `shape`: Returns the dimensions of the dataframe (number of rows and columns).
- `columns`: Returns the column names of the dataframe.
- `info()`: Displays the structure of the dataframe, including data types and a preview of the data.
- `describe()`: Provides summary statistics for each column in the dataframe.

In [None]:
# insert here 

# Data Manipulation with Pandas

We will now cover some basic data manipulation techniques using the pandas library.

In [None]:
# Create a sample "dictionary" object
data = {
    'name': ['Andie', 'Bridger', 'Scott'],
    'gender': ['Female', 'non-binary', 'Male'],
    'male': [False, False, True],
    'income_cat': ['middle', 'poor', 'rich'],
    'park_dist': [1.0, 0.5, 0.1]
}

data


In [None]:
# turn the dictionary into a pandas dataframe (insert here)

## `assign()`

The `assign()` method can be used to add new columns or modify existing ones.

In [None]:
# insert here

## `np.where()`

The `np.where()` function can be used to conditionally modify values in a dataframe.

In [None]:
# insert here

## filter with conditional statements

Filtering rows in a dataframe can be done using boolean indexing.

In [None]:
# Filter rows where pollution_level is 'Low'
df_env_data = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', 'Grassland', 'Urban'],
    'species_richness': [120, 45, 80, 60, 30],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High']
})

# insert here

## `dropna()`

Dropping rows with missing values can be done using the `dropna()` method.

In [None]:
# Drop rows with missing values in the 'ecosystem' column
df_env_data_na = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', np.nan, 'Urban'],
    'species_richness': [120, 45, 80, 60, 30],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High']
})

# insert here

## select with []

Selecting specific columns can be done using square bracket notation

In [None]:
# insert here

## `groupby()`

Grouping data and calculating aggregate statistics can be done using the `groupby()` method.

In [None]:
# Group by 'ecosystem' and calculate the mean species richness
df_env_long = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', 'Grassland', 'Urban', 'Forest', 'Desert', 'Wetland', 'Grassland', 'Urban'],
    'species_richness': [120, 45, 80, 60, 30, 110, 50, 85, 65, 35],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High', 'Low', 'High', 'Medium', 'Low', 'High']
})

# insert here

## `agg()`

The `agg()` method can be used to apply multiple aggregation functions to grouped data.

In [None]:
# insert here

# Basic Data Visualization

Unlike R where using the ggplot2 package has become the dominant way to make plots, Python has many way to visualize data. Below, we go through one example. We will end the lecture here today, but the code it provided below so that you have an example. You should explore it on your own! 

In [None]:
# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['hwy'], bins=10, kde=False, color='black')
plt.title('Distribution of Highway Miles per Gallon')
plt.xlabel('Highway Miles per Gallon')
plt.ylabel('Count')
plt.show()

In [None]:
# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='cyl', y='hwy', data=df)
plt.title('Highway MPG Distribution by Cylinder Count')
plt.xlabel('Number of Cylinders')
plt.ylabel('Highway Miles per Gallon')
plt.show()

In [None]:
# Bar chart
plt.figure(figsize=(10, 6))
sns.countplot(x='manufacturer', data=df, color='black')
plt.title('Number of Observations by Manufacturer')
plt.xlabel('Manufacturer')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='displ', y='hwy', data=df)
plt.title('Engine Displacement vs Highway MPG')
plt.xlabel('Engine Displacement (litres)')
plt.ylabel('Highway Miles per Gallon')
plt.show()

In [None]:
# Scatter plot with color grouping
plt.figure(figsize=(10, 6))
sns.scatterplot(x='displ', y='hwy', hue='manufacturer', data=df)
plt.title('Engine Displacement vs Highway MPG by Company')
plt.xlabel('Engine Displacement (litres)')
plt.ylabel('Highway Miles per Gallon')
plt.legend(title='Company')
plt.show()

In [None]:
# Facet plot
g = sns.FacetGrid(df, col='manufacturer', col_wrap=4, height=4)
g.map(sns.scatterplot, 'displ', 'hwy')
g.set_axis_labels('Engine Displacement (litres)', 'Highway Miles per Gallon')
g.fig.suptitle('Engine Displacement vs Highway MPG by Manufacturer', y=1.03)
plt.show()