# Lesson 06 - Pandas DataFrames

### The following topics are discussed in this notebook:
* Reading data from a file.
* Selecting data from a DataFrame.
* Boolean masking. 
* Creating DataFrames.

### Additional Resources
* [Python Data Science Handbook, Ch 3](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)
* [DataCamp: Intermediate Python for Data Science, Ch 2](https://www.datacamp.com/courses/intermediate-python-for-data-science)





## Pandas DataFrames

Pandas is a Python package developed for performing data manipulation and data analysis. The core feature of Pandas is the **DataFrame** data structure. A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table

In [1]:
import numpy as np
import pandas as pd

## Reading Data From a File

We will often create DataFrames by reading data in from a file. 

In [None]:
states = pd.read_csv('state_data.csv')
states.head()

## Index of a DataFrame

By default, rows in a DateFrame are indexed numerically. However, we can assign one of the columns in the DataFrame to serve as an **index**. This will allow us to access rows by their number, or by their index value. 

In [None]:
states.set_index('Abbv', inplace=True)
states.head()

In [None]:
states.index.name = None
states.head()

## Selecting Elements of a DataFrame

There are two indexable attribues that can be used to access elements of a DataFrame: `loc` and `iloc`. 

* `loc` is used to access elements of the DataFrame using column and row names. 
* `iloc` is used to access elements of the DataFrame using numerical indices for the rows and columns.  

In [None]:
# Population of Missouri
print(states.loc['MO','Pop'])
print(states.iloc[24,2])

In [None]:
# All Missouri Information
print(states.loc['MO',:])
print()
print(states.iloc[24,:])

In [None]:
# Unemployment for first four states
print(states.loc[:'AR','Unemp'])
print()
print(states.iloc[:4,3])

In [None]:
print(states.ix[:4,'Unemp'])

## Alternate Method of Accessing Columns

We can access a column of a DataFrame using the following syntax: `my_dataframe.loc[:,'ColName']`. Fortunately, there is a more concise way of accessing this information.

In [None]:
print(states.Pop)

## Boolean Masking

We can use boolean masking along with `loc` to subset DataFrames.

In [None]:
sel = states.Unemp > 5
states.loc[sel,:]

In [None]:
states.loc[states.Area < 10000,:]

## Sorting by Columns

We can use the `sort_values()` method to sort the contents of a DataFrame.

In [None]:
states.sort_values('HS_Grad').head()

In [None]:
states.sort_values('HS_Grad', ascending=False).head()

## Adding Columns to a DataFrame

In [None]:
states['PopDensity'] = states.Pop / states.Area
states.head()

In [None]:
states.sort_values('PopDensity', ascending=False).head(n=10)

# Creating DataFrames

We will occasionally need to create a DataFrame from a set of lists or arrays. Before discussing how to do this, we need to introduce the `dict` data type. 

A **`dict`** is a data type that is similar to a list, except that elements are referenced by a name assigned to them at creation, rather than by an index. Entries in a `dict` are defined by in **key/value** pairs. 


In [None]:
sales_person = {
    'Name': 'Alice Smith',
    'Salary': 42000,
    'Clients': ['Stark Ent.', 'Wayne Ent.', 'Oscorp'],
    'SalesInvoices': [1204, 1250, 1321, 1347, 1598]
}

print(sales_person)

In [None]:
print(sales_person['Name'])
print(sales_person['Salary'])
print(sales_person['Clients'])
print(sales_person['SalesInvoices'])

We can use a `dict` to try to emulate the functionality of a DataFrame. 

In [None]:
abbreviation = ['AK', 'CO', 'IL', 'MO', 'NY']
state_name = ['Alaska', 'Colorado', 'Illinois', 'Missouri', 'New York']
population = [735132, 5268367, 12882135, 6044171, 19651127]
unemployment = [7.2, 2.4, 5.0, 4.0, 4.8]

states_dict = {'Abbv':abbreviation, 'State':state_name, 'Pop':population, 'UnEmp':unemployment}

We can look up information relating to Missouri as follows:

In [None]:
print(states_dict['Abbv'][3])
print(states_dict['State'][3])
print(states_dict['Pop'][3])
print(states_dict['UnEmp'][3])

Using a `dict` to store this type of data has some severe limitations:

* There is no convenient way of accessing an entire "row" at once. 
* We have to already know the numerical of any "row" whose information we wish to access. 
* There is no convenient way to sort our data when it is stored in a dict.

Fortunately, it is easy to create a DataFrame from a dict.

In [None]:
states_df = pd.DataFrame(states_dict)
states_df

In [None]:
x1 = np.random.normal(10,2,20)
x2 = np.random.normal(20,5,20)
y = np.random.choice(['A','B'], 20)

data = pd.DataFrame({'x1':x1, 'x2':x2, 'y':y})
data.head()