# Activity: Removing Data

## Introduction

In this activity you will practice using Pandas functionality to check for and remove any unwanted data from a dataset.
This activity will cover the following topics:
- Removing columns from a DataFrame
- Removing rows from a DataFrame
- Removing rows based on a condition
- Checking for duplicate data


In [1]:
import pandas as pd

#### Question 1

Create a `DataFrame` called `df` from the given CSV file `exotic_plants_data.csv`, then drop the column `Type` and assign the result to a new `DataFrame` called `df_no_type`.


In [2]:
# Your code here
df = pd.read_csv("exotic_plants_data.csv")

df_no_type = df.drop(["Type"], axis=1)

In [3]:
df_no_type

Unnamed: 0,Plant Name,Origin,Height (cm)
0,Orchid,Tropical,30
1,Fern,Tropical,40
2,Bamboo,Asia,600
3,Cactus,America,60
4,Bird of Paradise,Africa,150
...,...,...,...
71,Ficus,Asia,200
72,Columbine,North America,30
73,Jasmine,Asia,90
74,Fuchsia,Central and South America,40


In [4]:
# Question 1 Grading Checks

assert isinstance(df, pd.DataFrame), 'Have you created a DataFrame named df?'
assert isinstance(df_no_type, pd.DataFrame), 'Have you created a DataFrame named df_no_type?'


#### Question 2

Remove rows at index 57 and 61 from the `df` DataFrame and assign the result to a new `DataFrame` called `df_dropped_indices`.


In [5]:
# Your code here

df_dropped_indices = df.drop([57, 61])

df_dropped_indices

Unnamed: 0,Plant Name,Type,Origin,Height (cm)
0,Orchid,Ornamental,Tropical,30
1,Fern,Ground Cover,Tropical,40
2,Bamboo,Grass,Asia,600
3,Cactus,Succulent,America,60
4,Bird of Paradise,Ornamental,Africa,150
...,...,...,...,...
71,Ficus,Tree,Asia,200
72,Columbine,Flower,North America,30
73,Jasmine,Shrub,Asia,90
74,Fuchsia,Flower,Central and South America,40


In [6]:
# Question 2 Grading Checks

assert isinstance(df_dropped_indices, pd.DataFrame), 'Have you created a DataFrame named df_dropped_indices?'


#### Question 3

Remove rows where the `Origin` column is equal to `Africa` from the `df` DataFrame and store the result in a new `DataFrame` called `df_no_african_plants`.


In [7]:
# Your code here

df_no_african_plants = df[df['Origin'] != 'Africa']

In [8]:
# Question 3 Grading Checks

assert isinstance(df_no_african_plants, pd.DataFrame), 'Have you created a DataFrame named df_no_african_plants?'


#### Question 4

Check the `df` `DataFrame` for any duplicate rows and assign the result to a new `DataFrame` called `df_duplicates`.


In [9]:
# Your code here

df_duplicates = df[df.duplicated()]

df_duplicates

Unnamed: 0,Plant Name,Type,Origin,Height (cm)
6,Cactus,Succulent,America,60
30,Rafflesia,Flower,Southeast Asia,20
47,Kangaroo Paw,Flower,Australia,60
48,Bougainvillea,Shrub,South America,400
49,Bird of Paradise,Ornamental,Africa,150
50,Venus Flytrap,Carnivorous,North America,15
51,Rose,Flower,Asia,60


In [10]:
# Question 4 Grading Checks

assert isinstance(df_duplicates, pd.DataFrame), 'Have you created a DataFrame named df_duplicates?'


#### Question 5

Check the `df` `DataFrame` for any duplicate rows based on the `Plant Name` and `Type` columns and assign the result to a new `DataFrame` called `df_plant_type_duplicates`.


In [11]:
# Your code here

df_plant_type_duplicates = df[df.duplicated(subset=['Plant Name', 'Type'])]

df_plant_type_duplicates

Unnamed: 0,Plant Name,Type,Origin,Height (cm)
6,Cactus,Succulent,America,60
22,Bamboo,Grass,Asia,500
30,Rafflesia,Flower,Southeast Asia,20
47,Kangaroo Paw,Flower,Australia,60
48,Bougainvillea,Shrub,South America,400
49,Bird of Paradise,Ornamental,Africa,150
50,Venus Flytrap,Carnivorous,North America,15
51,Rose,Flower,Asia,60
53,Tulip,Flower,Europe,30
55,Sunflower,Flower,North America,180


In [12]:
# Question 5 Grading Checks

assert isinstance(df_plant_type_duplicates, pd.DataFrame), 'Have you created a DataFrame named df_duplicates?'


#### Question 6

Create a mask called `clean_mask` that will clean up any duplicates in the `df` DataFrame that have the same `Plant Name` and `Origin` and only keep the most up-to-date duplicate entry.


In [13]:
# Your code here

clean_mask = df.duplicated(subset=['Plant Name', 'Origin'], keep='last')

df_cleaned = df[~clean_mask]

In [14]:
# Question 6 Grading Checks

assert isinstance(clean_mask, pd.Series), 'Have you created a Series named clean_mask?'
