# Playing with Pandas and Numpy


---

*Key questions:*

  - "How can I import data in Python ?"
  - "What is Pandas ?"
  - "Why should I use Pandas to work with data ?"

In [None]:
import urllib.request
# You can also get this URL value by right-clicking the `survey.csv` link above and selecting "Copy Link Address"
url = 'https://github.com/aaneloy/DATA2010-Fall2021-Lab/blob/main/survey.csv'
urllib.request.urlretrieve(url, 'surveys.csv')

In [None]:
!pip install pandas matplotlib

In [None]:
import pandas as pd
import numpy as np

In [None]:
surveys_df = pd.read_csv("survey.csv")

Notice when you assign the imported DataFrame to a variable, Python does not
produce any output on the screen. We can view the value of the `surveys_df`
object by typing its name into the cell.

In [None]:
surveys_df

You can also select just a few rows, so it is easier to fit on one window, you can see that pandas has neatly formatted the data to fit our screen.

Here, we will be using a function called **head**.

The `head()` function displays the first several lines of a file. It is discussed below.


In [None]:
surveys_df.head()

## Exploring Our Species Survey Data

Again, we can use the `type` function to see what kind of thing `surveys_df` is:



In [None]:
type(surveys_df)


As expected, it's a DataFrame (or, to use the full name that Python uses to refer
to it internally, a `pandas.core.frame.DataFrame`).

What kind of things does `surveys_df` contain? DataFrames have an attribute
called `dtypes` that answers this:



In [None]:
surveys_df.dtypes

## Challenge - DataFrames

Using our DataFrame `surveys_df`, try out the attributes & methods below to see
what they return.

1. `surveys_df.columns`
2. `surveys_df.shape` Take note of the output of `shape` - what format does it
   return the shape of the DataFrame in?   HINT: [More on tuples, here](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).
3. `surveys_df.head()` Also, what does `surveys_df.head(15)` do?
4. `surveys_df.tail()`




## Solution - DataFrames

... try it yourself !

# Calculating Statistics From Data

We've read our data into Python. Next, let's perform some quick summary
statistics to learn more about the data that we're working with. We might want
to know how many animals were collected in each plot, or how many of each
species were caught. We can perform summary stats quickly using groups. But
first we need to figure out what we want to group by.

Let's begin by exploring our data:



In [None]:
# Look at the column names
surveys_df.columns

Let's get a list of all the species. The `pd.unique` function tells us all of
the unique values in the `species_id` column.

In [None]:
pd.unique(surveys_df['species_id'])

# Interpreting missing data

1. Check if there is any missing data
2. Replace the missing data by `zero`
3. Copy the dataframe so that we can use later.

In [None]:
# Check for missing values

surveys_df.isna().sum()   # or s.isnull().sum() for older pandas versions

In [None]:
#Copy the dataframe and replace the Null values by zero

surveys_df_2 = surveys_df.copy()

In [None]:
surveys_df_2['weight'].fillna(0, inplace=True)
surveys_df_2['hindfoot_length'].fillna(0, inplace=True)
surveys_df_2

In [None]:
# We can also simply drop the 'NaN' or 'NULL' value rows
surveys_df_3 = surveys_df.dropna()
surveys_df_3

In [None]:
#Check if any null values


surveys_df_3.isna().sum()

## Challenge - Statistics

1. Create a list of unique site ID's found in the surveys data. Call it
  `site_names`. How many unique sites are there in the data? How many unique
  species are in the data?

2. What is the difference between `len(site_names)` and `surveys_df['site_id'].nunique()`?

## Solution - Statistics

In [None]:
site_names = pd.unique(surveys_df['site_id'])
print(len(site_names), surveys_df['site_id'].nunique())

# Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes
within fields of our data. For example, we might want to calculate the average
weight of all individuals per site.

We can calculate basic statistics for all records in a single column using the
syntax below:

In [None]:
surveys_df['weight'].describe()


We can also extract one specific metric if we wish:



In [None]:
surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
# only the last command shows output below - you can try the others above in new cells
surveys_df['weight'].count()


But if we want to summarize by one or more variables, for example sex, we can
use **Pandas' `.groupby` method**. Once we've created a groupby DataFrame, we
can quickly calculate summary statistics by a group of our choice.



In [None]:
# Group data by sex
grouped_data = surveys_df.groupby('sex')


The **pandas function `describe`** will return descriptive stats including: mean,
median, max, min, std and count for a particular column in the data. **Note** Pandas'
`describe` function will only return summary values for columns containing
numeric data.



In [None]:
# Summary statistics for all numeric columns by sex
grouped_data.describe()

# Provide the mean for each numeric column by sex
# As above, only the last command shows output below - you can try the others above in new cells
grouped_data.mean()


The `groupby` command is powerful in that it allows us to quickly generate
summary stats.



## Challenge - Summary Data

1. How many recorded individuals are female `F` and how many male `M`
    - A) 17348 and 15690
    - B) 14894 and 16476
    - C) 15303 and 16879
    - D) 15690 and 17348


2. What happens when you group by two columns using the following syntax and
    then grab mean values:
	- `grouped_data2 = surveys_df.groupby(['site_id','sex'])`
	- `grouped_data2.mean()`


3. Summarize weight values for each site in your data. HINT: you can use the
  following syntax to only create summary statistics for one column in your data
  `by_site['weight'].describe()`



## Solution- Summary Data

In [None]:
## Solution Challenge 1
grouped_data.count()

### Solution - Challenge 2

The mean value for each combination of site and sex is calculated. Remark that the 
mean does not make sense for each variable, so you can specify this column-wise: 
e.g. I want to know the last survey year, median foot-length and mean weight for each site/sex combination:

In [None]:
# Solution- Challenge 3
surveys_df.groupby(['site_id'])['weight'].describe()

## Did you get #3 right?
 **A Snippet of the Output from part 3 of the challenge looks like:**

```
	site_id
	1     count    1903.000000
	      mean       51.822911
	      std        38.176670
	      min         4.000000
	      25%        30.000000
	      50%        44.000000
	      75%        53.000000
	      max       231.000000
         ...
```



## Quickly Creating Summary Counts in Pandas

Let's next count the number of samples for each species. We can do this in a few
ways, but we'll use `groupby` combined with **a `count()` method**.




In [None]:
# Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)


Or, we can also count just the rows that have the species "DO":



In [None]:
surveys_df.groupby('species_id')['record_id'].count()['DO']

## Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For
example let's multiply all weight values by 2. A more practical use of this might
be to normalize the data according to a mean, area, or some other value
calculated from our data.



In [None]:
# Multiply all weight values by 2 but does not change the original weight data, rather than create new column with "weighted value" variable
surveys_df['weighted_value'] = surveys_df['weight']*2

In [None]:
surveys_df

## Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.



In [None]:
import matplotlib.pyplot as plt

## To make sure figures appear inside Jupyter Notebook
%matplotlib inline

# Create a quick bar chart
plt.figure(figsize=(12,6))
species_counts.plot(kind='bar')

#### Animals per site plot

We can also look at how many animals were captured in each site.

In [None]:
total_count = surveys_df.groupby('site_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar')

## _Extra Plotting Challenge_

1. Create a plot of average weight across all species per plot.

2. Create a plot of total males versus total females for the entire dataset.
 
3. Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge:
[For more on Pandas plots, visit this link.](http://pandas.pydata.org/pandas-docs/stable/visualization.html#basic-plotting-plot)





### _Solution to Extra Plotting Challenge 1_

![The output should look like this](Plot1.png "The output should look like this")

### _Solution to Extra Plotting Challenge 2_

# Solution Plotting Challenge 2
## Create plot of total males versus total females for the entire dataset.

![The output should look like this](Plot2.png "The output should look like this")

### _Solution to Extra Plotting Challenge 3_

First we group data by site and by sex, and then calculate a total for each site.

![The output should look like this](Plot3.png "The output should look like this")


This calculates the sums of weights for each sex within each plot as a table

```
site  sex
site_id  sex
1        F      38253
         M      59979
2        F      50144
         M      57250
3        F      27251
         M      28253
4        F      39796
         M      49377
<other sites removed for brevity>
```

Below we'll use `.unstack()` on our grouped data to figure out the total weight that each sex contributed to each plot.



Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows: