# Pandas (Python Data Analysis Library)

## Series
Series are going to be a new datatype that are similar to numpy arrays, as they are one type, but take on a slightly different approach to manipulating data. To instantiate them we see a similar format to arrays - `series = Series([val_1, val_2, val_3, ...])`

 - Series have an index attribute, which is the row location of the data
 - Manipulating a Series won't affect the index

In [1]:
import pandas as pd # same as numpy, you'll usually see this as pd
from pandas import Series, DataFrame

s1 = Series([-2, 1, 0, -1, 2, -1])
print(s1) # elements on the left are the index, also notice it incorporates a dtype (Numpy)

0   -2
1    1
2    0
3   -1
4    2
5   -1
dtype: int64


Similar to a dictionary, *Series* have **indexes** and **values**. If an index isn't provided, it is just defaults to the elements position.

In [2]:
# Similar to a dictionary it has index, values
print("\n**************Index/Values:")
print(s1.index)
print(s1.values)


**************Index/Values:
RangeIndex(start=0, stop=6, step=1)
[-2  1  0 -1  2 -1]


We can, however, provide our own **indexes**

In [3]:
print("\n**************Defining Indexes:")
s2 = Series([-2, 1, 0, -1, 2, -1], index=['a', 'b', 'c', 'd', 'e', 'f'])
print(s2)


**************Defining Indexes:
a   -2
b    1
c    0
d   -1
e    2
f   -1
dtype: int64


These **indexes** stay the same, regardless of any operations

In [4]:
s2.sort_values()

a   -2
d   -1
f   -1
c    0
b    1
e    2
dtype: int64

### Indexes Cont.
 - Indexes don't have to be numeric nor do they have to be unique
 - Indexes control a lot of the logic on how Series interact with each other

In [5]:
s2 + 1

a   -1
b    2
c    1
d    0
e    3
f    0
dtype: int64

In [None]:
s3 = Series([-2, 1, 0, -1, 2, -1], index=['a', 'a', 'c', 'd', 'e', 'e'])
s3

For the most part, *Series* effectively work like **numpy dicts**

In [9]:
d = {'a': 23, 'f':9, 'b': -21, 'e': 3.0}
s1 = Series(d)
s1

a    23.0
f     9.0
b   -21.0
e     3.0
dtype: float64

## Index/Value Existence
To find if a **value** exists, it must be compared against the `series.values`

In [10]:
print(s1)
print('a' in s1) # Only works for index
print(23 in s1.values)

a    23.0
f     9.0
b   -21.0
e     3.0
dtype: float64
True
True


### Adding *Series* together:
We can add series together, but if a value doesn't exist in both it's Nan

In [11]:
print(s1)
print()
print(s2)
print()
print(s1 + s2)

a    23.0
f     9.0
b   -21.0
e     3.0
dtype: float64

a   -2
b    1
c    0
d   -1
e    2
f   -1
dtype: int64

a    21.0
b   -20.0
c     NaN
d     NaN
e     5.0
f     8.0
dtype: float64


## DataFrames
A dataframe is a tabular represenation of data. It functions as a table-like representation of data, that is comprised of a number of *Series*. This means
that the columns can represent different types of data, but each column must be of one type. This makes a dataframe align with CSV's and Relational Databases very easily, but also means that operations across columns can be rather seamless.

### Instantiating DataFrames
There are a number of ways we can create a *DataFrame*:
 - *Dictionaries* of *Lists*
 - CSV's
 - SQL queries

In [12]:
# Can construct a dataframe from a dictionary of lists
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Montana', 'Montana', 'Montana'],
       'year': [2000, 2001, 2002, 2001, 2002, 2000, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 0.8, 0.75, 0.9]}

df = DataFrame(data)
df # Jupyter has nice formatting for dataframes

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Montana,2000,0.8
6,Montana,2001,0.75
7,Montana,2002,0.9


We can alter the resulting *DataFrame* depending on how we instantiate it:

In [13]:
print("\n**************Specifiying column order:")
df2 = DataFrame(data, columns = ['year', 'state', 'pop', 'ext']) # This just defines the order of the columns
df2


**************Specifiying column order:


Unnamed: 0,year,state,pop,ext
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2000,Montana,0.8,
6,2001,Montana,0.75,
7,2002,Montana,0.9,


### What happened with the **ext** column?

Loading *DataFrames* from csv files is even easier:
`df = pd.read_csv(rel_file_path)`

*Note: There are other optional parameters for __read_csv__*

In [14]:
df = pd.read_csv('sample_csv.csv')
df.head()

Unnamed: 0,X0,X1,X3,X4,X5
0,58,0.974626,-2.496562,-60,Pink
1,94,0.08573,-2.568997,30,Pink
2,23,0.40162,-0.678929,-47,Blue
3,100,0.816063,-5.406521,-26,Red
4,1,0.052036,0.93342,-46,Pink


## Dataframe Indexing
Indexing in DataFrames is very different than indexing lists. While it is most similar to dictionary indexing, it has it's own nuances.
 - We can get/set column names with `df.columns`
 - We Select by column or index `df['col_name']` or `df['index_val']`
 - We can select using ints if we use iloc `df.iloc[row, col]`

In [15]:
def get_data():
    """Generates a default dataframe
    Returns:
        DataFrame: Default dataframe for lecture
    """
    
    data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

    df = DataFrame(data, index = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'])
    return df

In [16]:
df = get_data()

**Column Names:**

In [17]:
df.columns

Index(['state', 'year', 'pop'], dtype='object')

**Column:**

In [18]:
df['year']

Ohio      2000
Ohio      2001
Ohio      2002
Nevada    2001
Nevada    2002
Name: year, dtype: int64

**Indexes:**

In [19]:
df.index

Index(['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], dtype='object')

**Specific Index:**

In [20]:
df.loc['Ohio'] # Note - This returns a subset of the original df

Unnamed: 0,state,year,pop
Ohio,Ohio,2000,1.5
Ohio,Ohio,2001,1.7
Ohio,Ohio,2002,3.6


**Accessing by Row, Col Index:**
 - `df.iloc[ix]` - Get row at ix
 - `df.iloc[ix, iy]` - Get element at row ix, col iy
 - `df.iloc[::ix]` - Get every ixth row

In [21]:
print(df.iloc[3])

state    Nevada
year       2001
pop         2.4
Name: Nevada, dtype: object


In [22]:
print(df.iloc[3, 1])

2001


In [23]:
print(df.iloc[::2])

         state  year  pop
Ohio      Ohio  2000  1.5
Ohio      Ohio  2002  3.6
Nevada  Nevada  2002  2.9


## DataFrame Subsetting
Like Numpy, there are many situations where we will want to be able to subset our *DataFrames*. Luckily *DataFrame* comparison operations result in **boolean** arrays. This means we can index into/subset our *DataFrames* by doing comparisons.

Example:
`subset_df = df[df[col] > val]`

In [24]:
df['year'] == 2002

Ohio      False
Ohio      False
Ohio       True
Nevada    False
Nevada     True
Name: year, dtype: bool

In [25]:
df[df['year'] == 2002]

Unnamed: 0,state,year,pop
Ohio,Ohio,2002,3.6
Nevada,Nevada,2002,2.9


## In class work - Problem 1:
Subset the dataframe to find which states had a population < 2 in 2001 or 2002.

*Note: When doing multiple comparisions you need to wrap them with __()__. `df[(cond1) & (cond2)]` or `df[(cond1) | (cond2)]`

In [57]:
df = get_data()
df[(df['year'] > 2000) & (df['pop'] < 2)]
# Space for work


Unnamed: 0,state,year,pop
Ohio,Ohio,2001,1.7


### DataFrame Assignment
The similarity between *DataFrames* and dictionaries continues with value assignment.

We can instantiate an empty column with a simple assignment operation:
```python
df[new_col] = val
```

In [None]:
df = get_data()
df['new_col'] = 0
df

We can also do this with computed columns:

In [None]:
# Assuming pop was in 100K format
df['actual_pop'] = df['pop'] * 100000
df

We can do this for any generated values, as long as they share the same **shape**/size

In [None]:
print("\n**************Creating Columns:")
df['sq_miles'] = np.random.randint(50, 100, size=df.shape[0]) # We can create a column by indexing at a new col_name
df

**But if the # of records doesn't match, we will get an error!**

In [None]:
df['sq_miles'] = np.random.randint(50, 100, size=df.shape[0]+10) # We can create a column by indexing at a new col_name
df

### DataFrame Deletions
With DataFrames to remove data we need to either **drop** or **delete** it.


*Note: a lot of pandas operations have the option to do the operation ***inplace***, otherwise the operation creates a modified copy*

In [None]:
df = get_data()
df_copy = df.drop('Ohio') # Can also do df1.drop('Ohio', inplace=True) which will update df1
df_copy

In [None]:
df = get_data()
df.drop('Ohio', inplace=True)
df

In [None]:
df = get_data()
del df['year'] # Like a dictionary, we can delete a 'key' with the del command
df

### Deletion Through Subsetting
We can also use subsetting/indexing to "delete" rows/columns

In [None]:
df = get_data()
df = df[['state', 'pop']] # Just select what you want
df

## DataFrame Arithmetic
Very similar to what we saw when working with Series, only aligned data is computed.

In [None]:
df1 = DataFrame(np.random.randint(-10, 100, size=12).reshape((4, 3)), columns=['a', 'b', 'c'])
df2 = DataFrame(np.random.randint(-10, 100, size=12).reshape((4, 3)), columns=['a', 'b', 'd'])

print(df1 + df2) #Just like Series arithemetic, non labeled data ends up NaN
"""Only +, -, *, and / will work with this"""


### DataFrame Functions - Elementwise
There are two ways that we can apply functions to a dataframe:
 - **Axis**: these operations act as aggregators and apply a fucnction over columns or rows
 - **Element-wise**: these operations will update each element based on the function
 
### Element-wise:
These operations primarily fall into either *arithmetic* operations or *ufunc-like* functions.

In [None]:
# Element based function applications

df1 = DataFrame(np.random.randint(-100, 100, size=12).reshape((4, 3)), columns=['a', 'b', 'c'])
df2 = DataFrame(np.random.randint(-100, 100, size=12).reshape((4, 3)), columns=['a', 'b', 'd'])

print("\n**************Original:")
print(df1)

print("\n**************ABS:")
print(np.abs(df1)) # We can use any of the numpy ufuncs

print("\n**************Square:")
print(df1 * 2) # We can use any of the numpy ufuncs

## In Class Work - Problem 2:
Calculate the mean population per year

In [None]:
df = get_data()

# Space for work


### Axis (apply):
 - With DataFrames we can apply functions across the dataframes *axis*.
  - axis=0 -> cols
  - axis=1 -> rows

In [None]:
df1 = DataFrame(np.arange(12).reshape((4, 3)), columns=['a', 'b', 'c'])
df2 = DataFrame(np.random.randint(-100, 100, size=12).reshape((4, 3)), columns=['a', 'b', 'd'])

df1

### Applying the Mean:
**Axis = 0**

In [None]:
func = lambda x: np.mean(x) # Function to apply
df1.apply(func)

**Axis = 1**

In [None]:
df1.apply(func, axis=1)

### Accessing Row/Col Values
Since the **apply()** method aggregates accross rows/cols, the row/col values can be accessed in the aggregation function. Seen below, we directly reference known col names.

In [None]:
func = lambda x: x['a'] * x['c']
print(df1.apply(func, axis=1))

## In class work - Problem 3:
Given $ \text{GDP Deflator} = \frac{\text{Nominal GDP}}{\text{Real GDP}} * 100 $ calculate the real GDP per row
using pandas apply() method and append the results to the current dataframe

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002],
       'Nominal GDP': [395.1, 398.9, 414.2, 102.2, 105.4],
       'GDP Deflator': [82.59, 84.23, 85.65, 84.23, 85.65]}
df = DataFrame(data)

# Space for work
