# Introduction to Python for Biology
# Day 5

## Pandas Library

* A data analysis library — **Pan**el **Da**ta **S**ystem.
* Created by Wes McKinney in 2009.
* Implemented in highly optimized Python/Cython.
* Like Excel or R for Python!

### Pandas is used for

* Cleaning data/munging.
* Exploratory analysis.
* Structuring data for plots or tabular display.
* Joining disparate sources.
* Modeling.
* Filtering, extracting, or transforming.

### Importing Pandas

Import Pandas at the top of your notebook. Give it the nickname **pd** so you don't have to keep typing "pandas." (But you can nickname it anything or leave out the nickname)

### Loading a CSV as a DataFrame

Pandas can load many types of files, but one of the most common types is .csv (comma separated values).

This creates a Pandas object called a **DataFrame.**  

DataFrames are powerful containers that have lots of built-in functions for exploring and manipulating your data. 

### Exploring the data using DataFrames

#### Use .head() to examine the top of the DataFrame

### Use .tail() to examine the bottom

#### The .shape property will tell you how many rows and columns you have

#### You can look up the names of your columns using the .columns property.

#### You can access a specific column with bracket syntax (like with dictionaries) using the column's string name.

#### You can also access it using dot notation. (When might this not work?)

Notice that this looks a little different than our DataFrame above. That is because it is a Series object. It's a little different than a Dataframe. 

**What's the difference between Pandas' Series and DataFrame objects?**  
Essentially, a Series object contains the data for a single column, and a DataFrame object is a matrix-like container for those Series objects that comprise your data. They mostly act like one another, but occasionaly you'll run into methods that only work for one.

#### Examining Your Data With .info()  
Provides information about:

* The name of the column/variable attribute.
* The type of index (RangeIndex is default).
* The count of non-null values by column/attribute.
* The type of data contained in the column/attribute.
* The unqiue counts of dtypes (pandas data types).
* The memory usage of our data set.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Types affect the way data is represented in machine learning models, whether we can apply math operators to them, etc.   

Some common problems with working with a new dataset:  
* Missing values.
* Unexpected types (string/object instead of int/float).
* Dirty data (commas, dollar signs, unexpected characters, etc.).
* Blank values that are actually "non-null" or single white-space characters.

#### Summarize the data with .describe()
It gives us the following statistics:

* Count, which is equivalent to the number of cells (rows).
* Mean, or, the average of the values in the column.
* Std, which is the standard deviation.
* Min, a.k.a., the minimum value.
* 25%, or, the 25th percentile of the values.
* 50%, or, the 50th percentile of the values ( which is the equivalent to the median).
* 75%, or, the 75th percentile of the values.
* Max, which is the maximum value.  

Let's try this on a single column as well as the entire dataframe.

There are also built-in math functions that will work on all columns of a DataFrame at once, as well as subsets of the data.

#### For example, I can use the .mean() function on the titanic DataFrame to get the mean for every column.

## Filtering and Sorting DataFrames

#### Filter drinks to include only European countries.

First we create a series of Booleans

Then we can use this series to filter our dataframe. (This is why we see the `drinks` twice.)

#### Filter drinks to include only European countries with wine_servings > 300.

#### Filter drinks to include only countries with wine_servings > 300 or beer_servings > 300.

#### Calculate the mean beer_servings for all of Europe.

#### Determine which 10 countries have the highest total_litres_of_pure_alcohol.

#### Which 10 countries have the lowest total_litres_of_pure_alcohol?

Side note: This does not change the underlying data. How can we change the underlying data?

#### Let's sort by multiple columns. First sort by `beer_servings` then by `wine_servings`.

## Data Visualization in Python

Data visualization is used to explore your data and to communicate your data (early and late in the workflow). My goal for you isn't to make you a data viz wizard, but to give you enough of an understanding to be able to explore your data and jumpstart your own learning.  

The **matplotlib** library is great for making simple plots. If you want to make elaborate interactive visualizations, it's probably not for you. But it's good for quick and dirty data viz that is relatively customizable. 

The **seaborn** library is built on matplotlib and has a lot more "out-of-the-box" functionality for quick plots.

In [4]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

### Choosing the correct plot
We'll match plot type with model 
<img src="assets/chartpicker.jpg"/>

## Basic MatplotLib `plt.plot(xxx)` Plotting Formula

* `plt.plot(x, y)` will create your plot. `.plot()` may be replaced with a different type of plot, like `.bar()`
* `plt.title(“My Title”)` will add a title “My Title” to your plot

* `plt.xlabel(“Year”)` will add a label “Year” to your x-axis

* `plt.ylabel(“Population”)` will add a label “Population” to your y-axis

* `plt.xticks([1, 2, 3, 4, 5])` set the numbers on the x-axis to be 1, 2, 3, 4, 5. 

We can also pass and labels as a second argument. For, example, if we use this code `plt.xticks([1, 2, 3, 4, 5], ["1M", "2M", "3M", "4M", "5M"])`, it will set the labels 1M, 2M, 3M, 4M, 5M on the x-axis.
plt.yticks() - works the same as `plt.xticks()`, but for the y-axis.

### Line Plot
* displays information as a series of data points called “markers” connected by straight lines 
* need the measurement points to be ordered (typically by their x-axis values) 
* often used to visualize a trend in data over intervals of time (time series)


To make a line plot with Matplotlib, we call `plt.plot()`. 

The first argument is used for the data on the horizontal axis, and the second is used for the data on the vertical axis. 

This function generates your plot, but it doesn’t display it. To display the plot, we need to call the `plt.show()` function. This is nice because we might want to add some additional customizations to our plot before we display it. For example, we might want to add labels to the axis and title for the plot.

In [1]:
years = [1974, 1975, 1976, 1977, 1978]
total_populations = [8939007, 8954518, 8960387, 8956741, 8943721]

### Scatter Plot
* shows all individual data points but doesn't connect them with lines
* used to display trends or correlations and how 2 variables compare

To make a scatter plot with Matplotlib, we can use the `plt.scatter()` function. Again, the first argument is used for the data on the horizontal axis, and the second - for the vertical axis.

#### Load in the iris dataset

We can save the figure with `.savefig()`

### Histogram
* represents the distribution of numeric data
* divide range of values into a series of intervals (AKA bins) and count how many values fall in each interval

To make a histogram with Matplotlib, we can use the `plt.hist()` function. The first argument is the numeric data, the second argument is the number of bins. The default value for the bins argument is 10.

In [153]:
numbers = [0.1, 0.5, 1, 1.5, 2, 4, 5.5, 6, 8, 9]

### Box Plot
* (AKA box-and-whisker plot) way to show the distribution of values based on the five-number summary: minimum, first quartile, median, third quartile, and maximum
* minimum and the maximum are just the min and max values from our data.
* median is the value that separates the higher half of a data from the lower half
* first quartile is the median of the data values to the left of the median in our ordered values.
* third quartile is the median of the data values to the right of the median in our ordered values. (third quartile minus first quartile is refered to as interquartile range)
* outlier is a data value that lies outside the overall pattern. There are many ways to identify what is an outlier. A commonly used rule says that a value is an outlier if it’s less than the first quartile - 1.5 * IQR or high than the third quartile + 1.5 * IQR. 

<img src="assets/boxplot.png"/>

To create this plot with Matplotlib we use `plt.boxplot()`. The first argument is the data points.

In [2]:
values = [1, 2, 5, 6, 6, 7, 7, 8, 8, 8, 9, 10, 21]


### Bar chart:
* represents categorical data with rectangular bars 
* each bar has a height corresponds to the value it represents
* used when we want to compare a given numeric value on different categories.


To make a bar chart with Maplotlib, we’ll need the `plt.bar()` function.

In [5]:
languages =['Python', 'SQL', 'Java', 'C++', 'JavaScript']
# generating the y positions
pos = np.arange(len(languages))
popularity = [56, 39, 34, 34, 29]



## Pandas Plotting
You can also plot directly from a dataframe because Pandas is tightly integrated with MatPlotLib. This is great for when you just need a quick and easy plot of your data.

#### Customizing: Change color and size 

#### Creating a Scatter Matrix with Pandas

## Seaborn
Provides a high-level interface for graphics. I find it super useful for Exploratory Data Analysis because it makes it easy to get to know your data quickly.

#### Seaborn VS matplotlib
* seaborn extends matplotlib, so they can do the same things
* matplotlib makes easy things easy and hard things possible
* seaborn tries to takes some of those hard things and make them easier to do
* seaborn has better defaults (e.g. colors, tick marks, etc.)
* seaborn also makes little easier to work with dataframes

#### Seaborn's well-defined set of hard things it make easy
* using default themes that are aesthetically pleasing.
* setting custom color palettes.
* making attractive statistical plots.
* easily and flexibly displaying distributions.
* visualizing information from matrices and DataFrames.

### Scatterplots
* make a scatter plot is just one line of code using `lmplot()` 
* pass your DataFrame to the `data=` argument, passing column names to the axes arguments `x=` and `y=`

#### Scatterplot parameters
Seaborn doesn't have a dedicated scatter plot function, which is why you see a diagonal line. We actually used Seaborn's function for fitting and plotting a regression line.

**Useful plotting options to set**
* Set `fit_reg=False` to remove the regression line, since we only want a scatter plot.
* set `hue='species'` to color our points by the species. This hue argument is very useful because it allows you to express a third dimension of information using color.

If we want to continue customizing this plot (like tweaking axes), then we can use matplotlib to do these customizations. 

Let's take a tour of some useful quick seaborn plots.

### Pairplot 
Shows relationships between all variables. I find this one super useful!

### Heatmap

# Independent Practice

Load the `drinks.csv` data.  

Perform the following:  

1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the beer_servings column/Series to a variable.
4. Calculate summary statistics for beer_servings.
5. Calculate the mean of beer_servings.
6. Count the values of unique categories in continent. (.value_counts)
7. Print the dimensions of the drinks DataFrame.
8. Find the first three items of the value counts of the continent column.

### Bonus Problem: Pandas Practice: Filtering 

#### Using the UFO data ("ufo.csv")

1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state VA.
5. Find only UFO reports from Arlington, VA.

**Extra Bonus Items (check bonus materials for hints)**
6. Find the number of missing values in each column.
7. Show only UFO reports where city is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of city and state.
11. Drop rows where City or Shape Reported is missing.

### Data Viz: Exploring the Titanic Data
Explore the data in the `titanic.csv` file using your new knowledge of Seaborn and matplotlib. Be prepared to tell one interesting finding about this dataset (think about what you might want to explore further with modeling and statistics).

Stumped? Try looking at survival rates for gender and boarding class. 

### Bonus Problem: Exploring the Wine Data
Explore the data in the `wine.csv` file using your new knowledge of Seaborn and matplotlib. Be prepared to tell one interesting finding about this dataset (think about what you might want to explore further with modeling and statistics).

Stumped? Try asking how different variables relate to wine quality.