<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Hands on with Pandas

---

## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!


* Adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. 


* The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. 


* In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. 

* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality


* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)


In [None]:
from numpy import * #Load all the numpy packages


### Creating ndarrays
An array object represents a multidimensional, homogeneous array of fixed-size items.

In [None]:
# Creating arrays
a = zeros((3))
b = ones((2,3))
c = random.randint(1,10,(2,3,4))
d = arange(0,11,1)

What are these functions?

In [None]:
a

In [None]:
b

In [None]:
c

In [None]:
d

In [None]:
 # reassign
 a = array( [20,30,40,50] )
 b = arange( 4 )
 b

In [None]:
 c = a-b
 c

In [None]:
 b**2

## Indexing, Slicing and Iterating

In [None]:
# one-dimensional arrays work like lists:
a = arange(10)**2
a

In [None]:
a[2:5]

In [None]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

b = random.randint(1,100,(4,4))
b

In [None]:
# Guess the output
print(b[2,3])
print(b[0,0])

In [None]:
b[0:3,1],b[:,1]

In [None]:
b[1:3,:]

### Using Pandas

Pandas is frequently used in data science because it offers a large set of commonly used functions, is relatively fast, and has a large community. Because many data science libraries also use NumPy to manipulate data, you can easily transfer data between libraries (as we will often do in this class!).

Pandas is a large library that typically takes a lot of practice to learn. It heavily overrides Python operators, resulting in odd-looking syntax. For example, given a `DataFrame` called `cars` which contains a column `mpg`, we might want to view all cars with mpg over 35. To do this, we might write: `cars[cars['mpg'] > 35]`. In standard Python, this would most likely give a syntax error.

Pandas also highly favors certain patterns of use. For example, looping through a `DataFrame` row by row is highly discouraged. Instead, Pandas favors using **vectorized functions** that operate column by column. (This is because each column is stored separately as an `ndarray`, and NumPy is optimized for operating on `ndarray`s.)

Do not be discouraged if Pandas feels overwhelming. Gradually, as you use it, you will become familiar with which methods to use and the "Pandas way" of thinking about and manipulating data.

In [None]:
# Load Pandas into Python
import pandas as pd

<a id="reading-files"></a>
### Reading Files, Selecting Columns, and Summarizing

In [None]:
users = pd.read_table('data/user.tbl', sep='|')

**Examine the users data.**

In [None]:
type(users)             # check its type

In [None]:
users                   # Print the first 30 and last 30 rows.

In [None]:
users.head()            # Print the first five rows.

In [None]:
users.head(10)          # Print the first 10 rows.

In [None]:
users.tail()            # Print the last five rows.

In [None]:
# The row index (aka "the row labels" — in this case integers)
users.index            

In [None]:
# Column names (which is "an index")
users.columns

In [None]:
# Datatypes of each column — each column is stored as an 
# ndarray, which has a datatype
users.dtypes

In [None]:
# Number of rows and columns
users.shape

In [None]:
# All values as a NumPy array
users.values

In [None]:
# Concise summary (including memory usage) — 
# useful to quickly see if nulls exist
users.info()

** Select or index data.**<br>
Pandas `DataFrame`s have structural similarities with Python-style lists and dictionaries.  
In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

In [None]:
# Select a column
users['occupation']

In [None]:
type(users['occupation'])

In [None]:
# Select one column using the DataFrame attribute.
users.occupation

# While a useful shorthand, these attributes only exist
# if the column name has no punctuations or spaces.

**Summarize (describe) the data.**<br>
Pandas has a bunch of built-in methods to quickly summarize your data and provide you with a quick general understanding.

In [None]:
# Describe all numeric columns.
users.describe()

In [None]:
# Describe all columns, including non-numeric.
users.describe(include='all')

In [None]:
# Describe a single column — recall that "users.occupation" 
# refers to a Series.
users["occupation"].describe()

In [None]:
# Calculate the mean of the ages.
users["age"].mean()

**Count the number of occurrences of each value.**

In [None]:
users["occupation"].value_counts()     # Most useful for categorical variables

In [None]:
# Can also be used with numeric variables
#   Try .sort_index() to sort by indices or .sort_values() 
# to sort by counts.
users["age"].value_counts()

In [None]:
# You can also do it the "long way"
users.groupby("occupation")["user_id"].count()

<a id="filtering-and-sorting"></a>
### Filtering and Sorting
- **Objective:** Filter and sort data using Pandas.

We can use simple operator comparisons on columns to extract relevant or drop irrelevant information.

**Logical filtering: Only show users with age < 20.**

In [None]:
# Create a Series of Booleans…
# In Pandas, this comparison is performed element-wise 
# on each row of data.
young_bool = users["age"] < 20
young_bool

In [None]:
# …and use that Series to filter rows.
# In Pandas, indexing a DataFrame by a Series of Booleans 
# only selects rows that are True in the Boolean.
users[young_bool]

In [None]:
# Or, combine into a single step.
users[users["age"] < 20]

In [None]:
# Important: This creates a view of the original DataFrame, 
# not a new DataFrame.
# If you alter this view (e.g., by storing it in a variable and 
# altering that)
# You will alter only the slice of the DataFrame and not 
# the actual DataFrame itself
# Here, notice that Pandas gives you a SettingWithCopyWarning 
# to alert you of this.

# It is best practice to use .loc and .iloc instead of the syntax below

users_under20 = users[users["age"] < 20]   
# To resolve this warning, copy the `DataFrame` using `.copy()`.
users_under20['is_under_20'] = True

In [None]:
users.head()

In [None]:
users_under20.head()

To create the is_under_20 column in the original DataFrame we could use `.loc`

The syntax is:

`my_dataframe.loc[<filter_condition>, <column>] = <new_value>`

In [None]:
users.loc[users["age"] < 20, "is_under_20"] = True
users.head()

In [None]:
users.loc[users["age"] >= 20, "is_under_20"] = False
users.head()

`.loc` is also useful if you want to filter **both** rows and columns at the same time

In [None]:
# Select one column from the filtered results.
users.loc[users["is_under_20"], "occupation"]

**Logical filtering with multiple conditions**

In [None]:
# Ampersand for `AND` condition. (This is a "bitwise" `AND`.)
# Important: You MUST put parentheses around each expression 
# because `&` has a higher precedence than `<`.

users[(users["is_under_20"]) & (users["gender"] == 'M')]

In [None]:
# Pipe for `OR` condition. (This is a "bitwise" `OR`.)
# Important: You MUST put parentheses around each expression 
# because `|` has a higher precedence than `<`.

users[(users["is_under_20"]) | (users["age"] > 60)]

In [None]:
# Preferred alternative to multiple `OR` conditions
users[users["occupation"].isin(['doctor', 'lawyer'])]

**Sorting**

In [None]:
# Sort a Series.
users["age"].sort_values()

In [None]:
# Sort a DataFrame by a single column.
users.sort_values('age')

In [None]:
# Use descending order instead.
users.sort_values('age', ascending=False)

In [None]:
# Sort by multiple columns.
users.sort_values(['occupation', 'age'])

<a id="columns"></a>
### Renaming, Adding, and Removing Columns

- **Objective:** Manipulate `DataFrame` columns.

In [None]:
# Read drinks.csv into a DataFrame called "drinks".
drinks = pd.read_csv('data/drinks.csv')

In [None]:
drinks.head()

In [None]:
# Rename one or more columns in a single output using value mapping.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

In [None]:
# Rename one or more columns in the original DataFrame.
drinks.rename(columns={'beer_servings':'beer', 
                       'wine_servings':'wine'}, inplace=True)

drinks.head()

In [None]:
# Replace all column names using a list of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres', 'continent'] 

# Replace during file reading (disables the header from the file)
drinks_renamed = pd.read_csv('data/drinks.csv', header=0, 
                             names=drink_cols)
drinks_renamed.head()

In [None]:
# Replace after file has already been read into Python.
drinks.columns = drink_cols

drinks.head()

**Easy Column Operations**<br>
Rather than having to reference indexes and create for loops to do column-wise operations, Pandas is smart and knows that when we add columns together we want to add the values in each row together.

In [None]:
# Add a new column as a function of existing columns.
drinks['servings'] = drinks["beer"] + drinks["spirit"] + drinks["wine"]
drinks['mL'] = drinks["litres"] * 1000

drinks.head()

**Removing Columns**

In [None]:
# axis=0 for rows, 1 for columns
drinks.drop('mL', axis=1)

In [None]:
drinks.head()

In [None]:
# Drop on the original DataFrame rather than returning a new one.
drinks.drop('mL', axis=1, inplace=True)

drinks.head()

In [None]:
# Drop multiple columns.
drinks.drop(['beer', 'servings'], axis=1)