# Python Machine Learning In Biology:
# Introduction to Pandas

### What is Pandas?

* A data analysis library — **Pan**el **Da**ta **S**ystem.
* Created by Wes McKinney in 2009.
* Implemented in highly optimized Python/Cython.
* Like Excel or R for Python!

### Pandas is used for

* Cleaning data/munging.
* Exploratory analysis.
* Structuring data for plots or tabular display.
* Joining disparate sources.
* Modeling.
* Filtering, extracting, or transforming.

### Importing Pandas

Import Pandas at the top of your notebook. Give it the nickname **pd** so you don't have to keep typing "pandas." (But you can nickname it anything or leave out the nickname)

### Loading a CSV as a DataFrame

Pandas can load many types of files, but one of the most common types is .csv (comma separated values).

This creates a Pandas object called a **DataFrame.**  

DataFrames are powerful containers that have lots of built-in functions for exploring and manipulating your data. 

### Exploring the data using DataFrames

#### Use .head() to examine the top of the DataFrame

### Use .tail() to examine the bottom

#### The .shape property will tell you how many rows and columns you have

#### You can look up the names of your columns using the .columns property.

#### You can access a specific column with bracket syntax (like with dictionaries) using the column's string name.

Notice how this looks a little different than how the dataframe was displayed above. This is because this is a pandas Series object.  

With a list of strings, you can also access a column (as a DataFrame instead of a Series).

**DataFrame vs. Series** Putting a column name in single square brackets always returns a Numpy Series. Putting a column name in double square brackets makes it a Data Frame.  

**What's the difference between Pandas' Series and DataFrame objects?**  
Essentially, a Series object contains the data for a single column, and a DataFrame object is a matrix-like container for those Series objects that comprise your data.

#### Examining Your Data With .info()  
Provides information about:

* The name of the column/variable attribute.
* The type of index (RangeIndex is default).
* The count of non-null values by column/attribute.
* The type of data contained in the column/attribute.
* The unqiue counts of dtypes (pandas data types).
* The memory usage of our data set.

Types affect the way data is represented in machine learning models, whether we can apply math operators to them, etc.   

Some common problems with working with a new dataset:  
* Missing values.
* Unexpected types (string/object instead of int/float).
* Dirty data (commas, dollar signs, unexpected characters, etc.).
* Blank values that are actually "non-null" or single white-space characters.

#### Summarize the data with .describe()
It gives us the following statistics:

* Count, which is equivalent to the number of cells (rows).
* Mean, or, the average of the values in the column.
* Std, which is the standard deviation.
* Min, a.k.a., the minimum value.
* 25%, or, the 25th percentile of the values.
* 50%, or, the 50th percentile of the values ( which is the equivalent to the median).
* 75%, or, the 75th percentile of the values.
* Max, which is the maximum value.  

Let's try this on a single column as well as the entire dataframe.

There are also built-in math functions that will work on all columns of a DataFrame at once, as well as subsets of the data.

#### For example, I can use the .mean() function on the titanic DataFrame to get the mean for every column.

# Independent Practice Time

Now that we have some basics down, let's practice some basic DataFrame use on a new data set.

**Pro tip:** When your cursor is in a string, you can use the "tab" key to browse file system resources and get a relative reference for the files that can be loaded in Jupyter notebook. Remember, you have to use your arrow keys to navigate the files populated in the UI.

1. Find and load the diamonds data set into a DataFrame.
2. Print out the columns.
3. What does the data set look like in terms of dimensions?
4. Check the types of each column.  
    a. What is the most common type?   
    b. How many entries are there?   
    c. How much memory does this data set consume?
5. Examine the summary statistics of the data set.

## Pandas Indexing

#### Let's read in the drug dataset for practicing indexing

A common task is that we'll want to operate on a specific portion of our data. With indexing, we can pull out a specific part of our DataFrame.  

pandas has three properties you can use for indexing:

* **.loc** indexes with the labels for rows and columns.
* **.iloc** indexes with the integer positions for rows and columns. 

#### To help clarify these differences, let's first reset the row labels to letters using the .set_index() function:

#### Using the .loc indexer, we can pull out rows B through F and the marijuana-use and marijuana-frequency columns.

#### We can do the same thing with the .iloc indexer. This time we use integers for the location.

While we created an index earlier, we can also use a column to set an index.
#### Let's use age to reset the index.

Age may not be the best index. 
#### We can use the df.reset_index() to reset our index.

### Creating DataFrames

You can create your own DataFrame without importing data from a file using pd.DataFrame() on a dictionary.  
Make sure the dictionary has lists of values that are all the same length. The keys correspond to the names of the columns, and the values correspond to the data in the columns.

In [45]:
mydata = pd.DataFrame({'Letters':['A','B','C'], 'Integers':[1,2,3], 'Floats':[2.2, 3.3, 4.4]})
mydata

Unnamed: 0,Floats,Integers,Letters
0,2.2,1,A
1,3.3,2,B
2,4.4,3,C


#### Examine the data types

Use .dtypes on your DataFrame.  

Strings are stored as a type called "object," as they are not guaranteed to take up a set amount of space (strings can be any length).

#### Rename columns

Change the column name Integers to int:

#### Rename all of the columns by assigning a list to the .columns property