# Module 6 - Data Analysis with Pandas

Pandas is a Python library that facilitates data analysis. Pandas DataFrames are extremelly useful to perform data analysis as it provides a user friendly way to visualize and run statistical tests on data, its interface has intuitive combinations of commands which resemble SQL in relational databases. 

Reference on Pandas: https://www.w3schools.com/python/pandas/default.asp

- Importing `pandas` as `pd` -- similarly to how `numpy` commonly used alis is `np`.

In [1]:
import pandas as pd

## Introduction to Series and DataFrame

### Series
- `Series` is a data structure provided by Pandas.

They are essentially 1D arrays/ While they can hold any type of data, we can also create labels to index the values with. 

In [2]:
# creating a series object
obj = pd.Series([4, 7, -5, 3])

- If not given a label, the default indexes used are numeric like in lists and arrays.

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

- So the series has data values and index:

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Pandas is an extension of numpy with *semantic* information.

- We can create custom index data to indicate the meaning (or _semantics) of the values.
- The index can be specified during construction.

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['jack', 'jill', 'joe', 'albert'])

In [7]:
obj2

jack      4
jill      7
joe      -5
albert    3
dtype: int64

- Accessing values

In [8]:
# extract values given an index?

obj2['jack']

4

In [9]:
obj2['albert']

3

In [10]:
# We can make use of generalized numpy indexing syntax to extract
# sub series

obj2[['jack', 'jill', 'joe']]

jack    4
jill    7
joe    -5
dtype: int64

In [11]:
# Construct boolean indexes based on a Numpy boolean array construction

obj2 > 0

jack       True
jill       True
joe       False
albert     True
dtype: bool

In [12]:
# We can use the boolean series to index the series.
# (similar to numpy's indexing with boolean arrays)
obj2[obj2 > 0]

jack      4
jill      7
albert    3
dtype: int64

- We can construct a `Series` from a dictionary

In [13]:
dict_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
dict_data

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [14]:
dict_data.keys()

dict_keys(['Ohio', 'Texas', 'Oregon', 'Utah'])

In [15]:
obj3 = pd.Series(dict_data)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

**Note:**
- `dict_data` does not have `California` count.
- What if we insist that `California` is part of the index?

In [16]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(dict_data, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [17]:
#
# Check which index entries have missing value
# using pd.isnull(...)
#
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [18]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [19]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

- We can perform operations between series, for example if we wan to aggregate the values for each state:

In [20]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [21]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [22]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

- Pandas also allow us to update indexes after they are created.

In [23]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [24]:
# assigning semantic indexes
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

- We can also add other semantic, or meta information to the object itself:

In [25]:
# assignning it an object name
obj.name = 'trading_income'
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
Name: trading_income, dtype: int64

In [26]:
# assigning a name to the indexes
obj.index.name = 'customer_names'
obj

customer_names
Bob      4
Steve    7
Jeff    -5
Ryan     3
Name: trading_income, dtype: int64

### DataFrame 

`DataFrame` is a data structure provided by Pandas to store and manipulater tabular data

- Construct a DataFrame object from a dictionary of values.

In [27]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [28]:
frame = pd.DataFrame(data)

In [29]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [32]:
# Individual columns can be extracted
# as a Pandas series

frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [33]:
# Multiple columns can be extracted
# as a Pandas DataFrame
frame[['state', 'year']]

Unnamed: 0,state,year
0,Ohio,2000
1,Ohio,2001
2,Ohio,2002
3,Nevada,2001
4,Nevada,2002


- Another way of constructing a DataFrame is from a collection of rows.

In [34]:
data = [['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9]]
data

[['Ohio', 2000, 1.5],
 ['Ohio', 2001, 1.7],
 ['Ohio', 2002, 3.6],
 ['Nevada', 2001, 2.4],
 ['Nevada', 2002, 2.9]]

In [35]:
frame2 = pd.DataFrame(data, columns=['state', 'year', 'pop'])
frame2

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
