#  Project 1: Data Analysis and Visualization with Real-Life Marine Litter Dataset

Now, let's combine everything we have learned so far to work on a real life dataset.

This dataset provides valuable information about marine litter found at the seafloor of the southeastern North Sea. It was collected during scientific research aimed at understanding the distribution and impact of marine litter in this area. Here's a breakdown of the dataset description and its columns:

- **Citation:** The dataset is attributed to Lars Gutow and published by the Alfred Wegener Institute. It is part of a larger study that investigates marine litter in the North Sea. The dataset can be accessed through the provided DOI link (https://doi.org/10.1594/PANGAEA.890785)

- **Location:** The dataset is specific to the North Sea.

- **Campaign:** The specific research campaign under which this data was collected was "HE419".

- **Method/Device:** A beam trawl (BEAM) was used to collect marine litter data from the seafloor.

## Dataset Columns:
- **Station:** This identifies the specific sampling station where the data was collected.
- **Date/Time:** The exact date and time of the sampling at the station.
- **Latitude:** The geographic latitude of the sampling point.
- **Longitude:** The geographic longitude of the sampling point.
- **Elevation [m]:** The depth at which the sampling occurred, measured in meters.
- **Litter obj:** This column likely records the specific litter object(s) found during sampling.
- **Litter cat:** This column categorizes the types of litter objects identified.
- **Litter fish:** This column may provide additional information regarding litter impacts on fish species or related observations.

This dataset is crucial for understanding marine pollution in the North Sea. It documents the types and quantities of litter found at various depths and locations, which can help researchers assess the impact of marine litter on marine ecosystems and contribute to environmental management efforts. The detailed metadata allows for effective spatial and temporal analyses, enabling scientists to track changes over time and identify potential sources of pollution.

Start by **importing the libraries** you will need for the analysis. 

You will use Pandas for data manipulation, NumPy for numerical operations, and Matplotlib and Seaborn for your visualizations:

In [10]:
#Your code goes here

Now that you have the libraries ready, **load the dataset** (*marine_seafloor_litter.csv*) from the CSV file. 

Display the first few rows and the dataframe's shape, so you can get an idea of what the data looks like:

In [1]:
# Loading the Dataset
# Let's start by loading our dataset.
#Your code goes here

# Display the first few rows of the dataset
#Your code goes here

Next, check for any **missing values** in the dataframe.

**Hint:** You can use the ***is.null()*** function along with Python’s ***sum()*** function to calculate the total number of missing values in each column.

In [12]:
#Handling Missing Values
# Let's check for any missing values in our dataset.
#Your code goes here


Missing Values in Dataset:
Event             0
Station          26
Date/Time         0
Latitude          0
Longitude         0
Elevation [m]     0
Litter obj        0
Litter cat        0
Litter fish       0
dtype: int64


Now, you can see how many missing values are in each column.

**Handle the missing values** by using the **fillna()** function and check the missing values in the dataframe again:

In [13]:
#Your code goes here


Missing Values in Dataset:
Event            0
Station          0
Date/Time        0
Latitude         0
Longitude        0
Elevation [m]    0
Litter obj       0
Litter cat       0
Litter fish      0
dtype: int64


Now, **check for duplicate entries** in your dataset. If you find any, remove them to ensure the analysis is based on unique observations.

Print the number of duplicated rows, and display the dataframe’s shape before and after removing duplicates to verify the operation was successful.

**Hint:** You can use Python’s **sum()** function together with the function we learned in class to identify duplicates to count the total number of duplicated rows.

In [2]:
# Handling Duplicates
# We should also check for any duplicate entries in our dataset.
#Your code goes here

# Dropping duplicates from our dataset
#Your code goes here

Now, **handle the outliers** in the dataframe with the *Interquartile Range (IQR)* method.

Check the dataframe's shape after the outliers were removed. 

In [3]:
#Your code goes here

Next, **calculate how many times fish were present alongside litter and how many times they were not.**

To do this, examine each value in the 'Litter fish' column. Use a condition to check whether the value is 'yes' or 'no', and count the occurrences of each.

In [5]:
#Calculating Statistics
# Now, we want to count the number of times fish are present in relation to the litter.

#Your code goes here

Now, **filter** the dataset to look specifically at **plastic litter** since it's a significant issue in marine environments.

Create a new DataFrame that contains only plastic litter and calculate the percentage of plastic litter in the dataset.

In [6]:
# Subsetting Data using Criteria
# Let's filter the data to focus on plastic litter, as it's one of the most common types found in the ocean.
#Your code goes here

# Calculate the percentage of plastic litter in relation to the total data
#Your code goes here

# Print the percentage of plastic litter
#Your code goes here

Next, **slice your DataFrame to focus on the columns that matter for our analysis**. 

Keep the *Station*, *Date/Time*, *Litter Object*, *Litter Category*, and *Litter Fish* columns.

In [18]:
#Slicing Rows and Columns
#Now that we have our dataset loaded, 
#let's extract the columns that are relevant to our analysis.

#Your code goes here

#Print the first 5 rows of your dataframe
#Your code goes here


Sliced Data:
  Station            Date/Time  Elevation [m]        Litter obj Litter cat  \
0      H1   2014-04-03 7:18:00            -18  plastic fragment   plastics   
1      H1   2014-04-03 7:18:00            -18     metal texture      metal   
2      H1   2014-04-03 7:18:00            -18      glass bottle      glass   
3     M03  2014-04-03 18:17:00            -32         net fiber   plastics   
4     M05  2014-04-03 23:14:00            -34         net fiber   plastics   

  Litter fish  
0          no  
1          no  
2          no  
3         yes  
4         yes  


**Visualize how often fish were present alongside each type of litter.** Create a bar plot showing the number of times fish were present or not with each litter category.

In [7]:
#Your code goes here

Now, **analyze how the amount of litter collected changes over time** by visualizing litter counts against the sampling date.

Convert the 'Date/Time' column to a proper datetime format using pd.to_datetime().

Extract the year from the 'Date/Time' column to group and count litter occurrences over time.

In [8]:
# Time Series Analysis of Litter Collection
#Your code goes here

# Extract Year and Month and count litter occurrences by Year-Month
#Your code goes here

# Create a time series plot
#Your code goes here

This line plot displays the number of litter objects collected over time. By visualizing litter collection on a timeline, we can identify trends or spikes in litter accumulation, which may correspond to specific events or seasonal patterns, providing insights for further research or action.

Now **examine the relationship between litter objects and fish populations**. 

Create a count plot to visualize how many different types of litter objects were found in samples where fish were present. 

Start by filtering the dataset to include only those entries where Litter fish is marked as 'yes'.

In [9]:
# Count Plot of Litter Objects Found with Fish
# Filter the dataset for entries where 'Litter fish' is 'yes'
#Your code goes here

With this graph we can see that there are multiple litter objects that have similar but not the same naming.

Before standardizing the data, verify that each litter type corresponds to a single litter category.

Filter the dataset to include only rows where fish were present, then group the data by **'Litter obj'** and display the unique **'Litter cat'** values within each group.

***Hint:*** Use **groupby()**

In [10]:
#Your code goes here

With this grouping we now know that **they are all the same category = 'plastics'**, so the next step is to standarize our data.

To **standardize the values in the Litter obj column** you can use the Pandas *.replace()* method.

In [11]:
# Replace 'net fibers' with 'net fiber' in the 'Litter obj' column
#Your code goes here

# Apply the mapping to the Water_Type column
#Your code goes here

Next, analyze the **relationship between elevation (depth) and litter category** with a boxplot that can show how litter types are distributed across different elevations:

In [12]:
#Your code goes here