# 1.2 Activity

In this activity, we will again use the eBird data we worked with in the previous lesson. Our focus this time will be on learning basic Python skills with NumPy and Pandas.

By the end of this activity, you will:
1. Merge two DataFrames.
2. Use Pandas .describe and .value_counts.
3. Create a new variable from an existing variable.
4. Write a conditional statement to recode an existing variable into a new variable.
5. Reflect on what the data manipulation conveyed about the data.

## Task 1: Setup Workspace

Repeat the steps from last week to mount your drive, import libraries, and read in the data file.

In [None]:
#mounting gdrive
from google.colab import drive
drive.mount('/gdrive')
#importing libraries
import gdown 
import pandas as pd
import numpy as np
#reading in the data file
gdown.download(id = '1JnVoGknl02xI-zBfCK9czx6E-OeC3Z9e')
Birds = pd.read_csv('/content/birds.csv')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


Downloading...
From: https://drive.google.com/uc?id=1JnVoGknl02xI-zBfCK9czx6E-OeC3Z9e
To: /content/birds.csv
100%|██████████| 163k/163k [00:00<00:00, 84.3MB/s]


## Task 2: Pandas

Return the first few rows of the DataFrame to ensure the data has loaded correctly.

In [None]:
Birds.head()

Unnamed: 0,Taxonomic Order,Category,Common Name,Scientific Name,Subspecies,Subspecies.1,Observation Count,Locality Type,Latitude,Longitude,Observation Date,Observation Time,Protocol Type,Duration Minutes,Effort Distance KM,Number Observers,All Species Reported
0,20762,species,American Crow,Corvus brachyrhynchos,,,1,P,32.176196,-86.352121,4/29/22,16:05:00,Traveling,84.0,1.305,1,1
1,20762,species,American Crow,Corvus brachyrhynchos,,,1,H,32.358285,-86.454432,4/26/22,17:00:00,Traveling,60.0,1.609,2,1
2,20762,species,American Crow,Corvus brachyrhynchos,,,2,P,32.310409,-86.101011,4/23/22,17:19:00,Traveling,6.0,9.74,4,1
3,20762,species,American Crow,Corvus brachyrhynchos,,,3,P,32.105727,-86.024669,4/28/22,6:17:00,Traveling,35.0,0.306,1,1
4,20762,species,American Crow,Corvus brachyrhynchos,,,1,P,32.345618,-86.03291,4/17/22,12:29:00,Stationary,61.0,,1,1


Now that you have your workspace set up, we can perform some basic functions using Pandas.

First, we might want to list all of the column names. Go to https://stackoverflow.com/questions/19482970/get-list-from-pandas-dataframe-column-headers and write the code to get a list of the columns.

In [None]:
list(Birds)
#list(Birds.columns.values) also works

['Taxonomic Order',
 'Category',
 'Common Name',
 'Scientific Name',
 'Subspecies',
 'Subspecies.1',
 'Observation Count',
 'Locality Type',
 'Latitude',
 'Longitude',
 'Observation Date',
 'Observation Time',
 'Protocol Type',
 'Duration Minutes',
 'Effort Distance KM',
 'Number Observers',
 'All Species Reported']

You may notice column(s) named "Unnamed: X". Drop any unnamed columns and re-assign the name "Birds" to the resulting DataFrame.

Use the `.drop()` method and specify `axis = 1` to reference the column(s) to drop.

If there are no columns to drop, state your reasoning and proceed to the next step.

In [None]:
#It seems like every column name is accounted for, I don't see any of the column names being listed in the form "Unamed:X"
#Columns 'Subspecies' and 'Subspecies.1' do have "NaN" values but I don't see any columns unnamed that need to be removed.

Sometimes, columns are not always read in as their intended data types by Pandas. This can be because a row might contain invalid information or the data type can be interpreted differently.

Use `.dtypes` to list the data types for all columns.


In [None]:
Birds.dtypes

Taxonomic Order           int64
Category                 object
Common Name              object
Scientific Name          object
Subspecies               object
Subspecies.1             object
Observation Count         int64
Locality Type            object
Latitude                float64
Longitude               float64
Observation Date         object
Observation Time         object
Protocol Type            object
Duration Minutes        float64
Effort Distance KM      float64
Number Observers          int64
All Species Reported      int64
dtype: object

Convert the `All Species Reported` column to boolean values using the `.astype()` function. Add this data to the original DataFrame as a new column called `All Species Reported (Bool)`.

In [None]:
Birds['All Species Reported (Bool)'] = Birds['All Species Reported'].astype('bool')

From the first few rows of data we can see some values for the `Common Name` feature. If we want to know its unique values, we can use the `.unique()` method on the column itself.

Print a list of the unique common names found within the data.


In [None]:
Birds['Common Name'].unique()

array(['American Crow', 'American Goldfinch', 'American Kestrel',
       'American Robin', 'American White Pelican', 'Anhinga',
       'Bald Eagle', 'Baltimore Oriole', 'Barn Swallow',
       'Belted Kingfisher', 'blackbird sp.', 'Black Vulture',
       'Blue Grosbeak', 'Blue Jay', 'Brown-headed Cowbird',
       'Brown-headed Nuthatch', 'Barred Owl', 'Brown Thrasher',
       'Broad-winged Hawk', 'Black-throated Green Warbler',
       'Blue-gray Gnatcatcher', 'Blue-headed Vireo', 'Blue-winged Teal',
       'Canada Goose', 'Carolina Chickadee', 'Carolina Wren',
       'Cattle Egret', 'Cedar Waxwing', 'Chipping Sparrow',
       'Chimney Swift', "Chuck-will's-widow", 'Cliff Swallow',
       'Common Grackle', 'Common Nighthawk', 'Common Yellowthroat',
       "Cooper's Hawk", 'crow sp.', 'Double-crested Cormorant',
       'Downy Woodpecker', 'Eastern Bluebird', 'Eastern Kingbird',
       'Eastern Meadowlark', 'Eastern Phoebe', 'Eastern Towhee',
       'Eurasian Collared-Dove', 'European Star

We may also be interested in the frequency in which each `Locality Type` appears.

Return the counts for each category using the `.value_counts()` method.

In [None]:
Birds['Locality Type'].value_counts()

P    1069
H     330
Name: Locality Type, dtype: int64

Now, let's take a look at the frequency for each bird type using a cross tabulation of the `Common Name` and `Locality Type` features with the `pd.crosstab()` function ([Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)).

In [None]:
pd.crosstab(Birds['Common Name'],Birds['Locality Type'], dropna=False)

Locality Type,H,P
Common Name,Unnamed: 1_level_1,Unnamed: 2_level_1
American Coot,3,1
American Crow,3,8
American Goldfinch,1,3
American Kestrel,0,1
American Robin,9,56
...,...,...
Yellow-rumped Warbler,7,15
Yellow-throated Vireo,1,1
Yellow-throated Warbler,0,1
blackbird sp.,1,0


Each row contains an `Observation Count` for the number birds observed in a given sighting. Let's get the descriptive statistics on this column using the `.describe()` method.

In [None]:
Birds['Observation Count'].describe()

count    1399.000000
mean        2.448177
std         4.864545
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max       150.000000
Name: Observation Count, dtype: float64

Let's see what happens when you use the `.describe()` method on different data types. Call this function on `All Species Reported`, print the results, and then call it again on `All Species Reported (Bool)`. Compare your results, which feature is more appropriate for this task?

In [None]:
Birds['All Species Reported'].describe()

count    1399.000000
mean        0.923517
std         0.265865
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: All Species Reported, dtype: float64

In [None]:
Birds['All Species Reported (Bool)'].describe()
#obviously, this is going to be a bit more meaningful for the numeric/ integer All Species Reported
#than it is for the boolean version of the variable, even though they are essentially the same information
#the advances descriptive statistics we are able to get from the All Species Reported column helps us
#understand the spread of values a little more than the TRUE/ FALSE values

count     1399
unique       2
top       True
freq      1292
Name: All Species Reported (Bool), dtype: object

In addition to the default descriptive statistics, you can obtain others too! Use this resource to return the standard error of the mean for the `Observation Count` column with `.sem()`. Round your output to two decimal places with the `round()` function.

In [None]:
round(Birds['Observation Count'].sem(),2)

0.13

## Task 3: NumPy

Let's compare some of what we did earlier using the NumPy library instead of Pandas. 

Convert the `Observation Count` column to a new `variable` called `Observation_Count` using `.to_numpy()`.

In [None]:
Observation_Count = Birds['Observation Count'].to_numpy()

You can access basic descriptives on an array much like the DataFrame column.

Use np.mean to return the mean of `Observation Count`.

In [None]:
np.mean(Birds['Observation Count'].to_numpy())

2.4481772694781987

Use np.min to return the minimum of `Observation Count`.

In [None]:
np.min(Birds['Observation Count'].to_numpy())

1

Use np.max to return the maximum of `Observation Count`.

In [None]:
np.max(Birds['Observation Count'].to_numpy())

150

## Task 4: Thought Question

> *In programming, learning that there are multiple ways to do the same thing is often confusing. In this activity, you discovered that you could return descriptive statistics using the Pandas .describe() method and the NumPy .mean, .min, .max, methods.<br><br>As you think about using NumPy vs. Pandas for obtaining a mean, is there a distinct advantage to using Pandas when you have multiple columns in a DataFrame? What is the difference between how NumPy works compared to how Pandas works on a DataFrame?*







So far, it seems like there is a huge advantage with using Pandas to complete multiple column actions instead of NumPy. In order to make manipulations/ calculations with the NumPy, we first have to run the data frame through the df['column'].to_numpy() method to change each column to an array before being able to work with it. Pandas on the other hand, seems to allow us to be able to work with the data frame/ columns as soon as the data is read in from the source file to a data frame. This is just my initial opinion, I'm sure further down the line we will learn better ways to interact with NumPy that make it advantageous over Pandas.