# Data Manipulation and Visualization with Python

In this notebook, we will cover data manipulation and visualization using Python. We will use the pandas library for data manipulation and the matplotlib and seaborn libraries for data visualization.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style for plots
sns.set(style="whitegrid")

# Data Manipulation with Pandas

We will now cover some basic data manipulation techniques using the pandas library.

In [None]:
# Build sample data
df_env = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', 'Grassland', 'Urban'],
    'species_richness': [120, 45, 80, 60, 30],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High']
})

df_env


## `select()`

Selecting specific columns can be done using the `loc` or `iloc` methods.

In [None]:
# Select only the 'ecosystem' and 'pollution_level' columns


## `groupby()`

Grouping data and calculating aggregate statistics can be done using the `groupby()` method. Let's work with a longer version of our dataframe. 

In [None]:
# Build long dataframe
df_env_long = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', 'Grassland', 'Urban', 'Forest', 'Desert', 'Wetland', 'Grassland', 'Urban'],
    'species_richness': [120, 45, 80, 60, 30, 110, 50, 85, 65, 35],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High', 'Low', 'High', 'Medium', 'Low', 'High']
})

# Group by 'ecosystem' and calculate the mean species richness


Let's make sure the columns are clear. 

In [None]:
# rename columns


## `agg()`

The `agg()` (aggregate) method can be used to apply multiple aggregation functions to grouped data.

In [None]:
# Group by 'ecosystem' and calculate the mean and total species richness


## `assign()`

The `assign()` method can be used to add new columns or modify existing ones. You can use it with a lambda function, which is a small anonymous function defined with the `lambda` keyword. The lambda indicates that we're going to define an operation following the `:` that we'd like it to evaluate and return the result of. That results is what we will asign to the new or modified variable. 

In [None]:
# Add 10 to the 'species_richness' column


## `np.where()`

The `np.where()` function can be used to conditionally modify values in a dataframe.

In [None]:
# Categorize species_richness as 'High' or 'Low' based on a threshold value


## Filtering

Filtering rows in a dataframe can be done using boolean indexing.

In [None]:
# Filter rows where pollution_level is 'Low'


## `dropna()`

Dropping rows with missing values can be done using the `dropna()` method.

In [None]:
# Build dataset with nas
df_env_na = pd.DataFrame({
    'ecosystem': ['Forest', 'Desert', 'Wetland', np.nan, 'Urban'],
    'species_richness': [120, 45, 80, 60, 30],
    'pollution_level': ['Low', 'High', 'Medium', 'Low', 'High']
})

# drop nas


# Reading in a Dataset and Gathering Basic Information

Let's start by reading in a CSV file and gathering basic information about the dataset.

In [None]:
# Read in the CSV file

# Display the first few rows of the dataframe


## Basic Information about the DataFrame

Here are some good ways to get basic information about a dataframe in Python:

- `head()`: Displays the first few rows of the dataframe.
- `tail()`: Displays the last few rows of the dataframe.
- `shape`: Returns the dimensions of the dataframe (number of rows and columns).
- `columns`: Returns the column names of the dataframe.
- `info()`: Displays the structure of the dataframe, including data types and a preview of the data.
- `describe()`: Provides summary statistics for each column in the dataframe.

In [None]:
# Display the first few rows of the dataframe


In [None]:
# Get the dimensions of the dataframe


In [None]:
# Get the column names of the dataframe


In [None]:
# Display the structure of the dataframe


In [None]:
# Provide summary statistics for each column in the dataframe


# Basic Data Clean
Let's perform a few quick data cleaning operations with `pandas`. Remember from our R lectures, data cleaning should be informed by your research question. 

In [None]:
# clean data: our research Q is:
#   - only interested in Ford, dodge and toyota
#   - not about the type of fuel "fl"

# Filter the dataframe to include only Ford, Dodge, and Toyota vehicles

# Drop the 'fl' variable from the dataframe

# get info


# Basic Data Visualization

Let's create some basic plots to visualize the data.

## Histogram

In [None]:
# Histogram


## Box plot

In [None]:
# Box plot


## Bar chart

In [None]:
# Bar chart


## Scatter plot

In [None]:
# Scatter plot


## Scatter plot with color

In [None]:
# Scatter plot with color grouping


## Facet plot
More advanced in Python than R, but including the code so you all have it for reference!

In [None]:
# Facet plot
