# Pandas Reference Guide

## Overview:
* Python library useful for data munging
* Stands for *Pan*el *Da*ta
* Dataframe: 2D labeled data structure with columns of potentially different types
    * As seen in spreadsheets, SQL tables, or a dict or Series objects
    * Very useful for merging and playing with data from multiple sources
* Series: 1D labeled data structure
    * Dataframes are made up of Series

## Data In / Data Out
---
### Loading Data
* Most common forms of data ingestion
    * Excel / CSV
    * Local Data

#### Excel / CSV
```python
df = read_csv('curr_work_dir/example.csv')
df2 = read_excel('curr_work_dir/example.xls')
```

#### Local Data
```python
# Dictionary
animal_data = {
        'num_legs': [2, 4, 8, 0],
        'num_specimen_seen': [10, 2, 1, 8]
}

df = pd.DataFrame(animal_data, index=['falcon', 'dog', 'spider', 'fish'])

# List of dictionaries
l = [{'x': 1, 'y': 2, 'z': 100},
      {'x': 2, 'y': 4, 'z': 100},
      {'x': 3, 'y': 8, 'z': 100}]

df = pd.DataFrame(l)

```

---
### Saving Data
* Pandas has the ability to save local dataframes to different types of files

```python
df.to_csv('curr_work_dir/example.csv')
df.to_excel('curr_work_dir/example.xls')
```
---
### Viewing Data

```python
# Show the first two entries
df.head(n=2)

# Show the last two entries
df.tail(n=2)

# Show a specified column
df.num_legs
```

In [98]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'favorite_color': ['blue', 'grey-11', 'burgundy', 'red', 'green', 'green', 'orange'],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

data_2 = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'favorite_color': ['blue', 'light green', 'burgundy', 'red', 'green', 'green', 'orange'],
     'height': [121.1, 67.1, 54.5, 87.9, 61.7, 81.1, 79.1]
 }

df= pd.DataFrame(data)
# df
# df.loc[0]
df.loc[0:3, 'city']

0    Mexico City
1        Toronto
2         Prague
3       Shanghai
Name: city, dtype: object

## Pandas Column Operations
---
### Joining Dataframes
* When columns share the same column, you can use it to merge two dataframes together
* Pandas will automatically rename duplicate column names for clarity
```python
df = pd.DataFrame(data)
df2 = pd.DataFrame(data_2)
df3 = pd.merge(df, df2)
```

### Adding Columns
```python
# Adding based on external elements
last_names = ['Smith', 'Lewis', 'Xi', 'Trelyian', 'Celona', 'Brown', 'Lopez']
df3['last_name'] = last_names

# Adding based on internal columns
df3['full_name'] = df['name'] + df['last_names']

# Adding based on a function
def passing_score(color_col):
    return score > 65

df3['passed?'] = passing_score(df['py-score'])
```

### Removing Columns
* Operations on the dataframe are dependent on the axis chosen
    * axis=0: Rows
    * axis=1: Columns

```python
drop_cols = ['py-score']
df_no_score = df3.drop(drop_cols, axis=1)

del df['py-score']
```

### Renaming Columns
* Inplace indicates how the data is being modified
    * Inplace=True: Changes are made directly to the dataframe
        * Will not return anything
    * Inplace=False: Changes are made then returned
        * Will return modified dataframe that will need to be saved

```python
# Renaming based on a dict mapping
name_mapping = {'height': 'height_cm', 'name': 'first_name'}
df_no_score.rename(columns=name_mapping, inplace=True)
```

### Filtering Down / Accessing Information
* loc will get the row by its col id / row label
* iloc will get the row by its integer index

```python
# Accessing column information
df_height = df3['height']
df_height = df3.height
first_height = df3.height[0]

# Accessing row information
first_row = df3.loc[1]

# Accessing sliced row information for a given column
some_cities = df3.loc[0:1, 'city']

# Accessing cell information
first_row_name = df3.at[0, 'name']
```

### Sorting a Dataframe
```python
# Sort the dataframe by name descending
df.sort_values(by='name', ascending=False)
```

### Typecasting columns
* The data type values determine the amount of memory used

```python
df_conv = df.astype(dtype={'age': str})
```

---
## Pandas Row Operations

### Appending Values
```python
df3 = df.append(df4) # assume df4 is of the same schema as df3
```
### Filtering Values
* Applying logical operations on the Series object that can then be applied to the df
* Super powerful due to the ability to be combined
* Operators:
    * NOT: ~
    * AND: &
    * OR: |
    * XOR: ^
    
```python
great_scores = df3[df3['score'] > 80]
extreme_scores = df3[(df3['score'] > 90 | df3['score'] < 65)]
```

### Removing Values
```python
# Drop the first row
df3 = df3.drop(labels=[0])
```

### Filling in Null Values
* There are multiple options to fill missing information
* Options:
    * Specified Value
    * Forward Look
    * Backwards Look

```python
df3.fillna(value=0)
df3.fillna(method='ffill')
```

### Deleting Null Values
* Optional Arguments:
    * 'axis': 
        * Determine between rows/columns
        * 'index'
        * 'column'
    * 'how':
        * 'all'
        * 'any'
    * 'thresh':
        * Drops row only if count(na) > thresh
        
```python
# Drop all Null Values
df_clean = df.dropna()
```

### Performing String Operations
* Pandas can perform many type of string operations on a given col/row

```python
# Adding new upper column
df3['city_upper'] = df['city'].str.upper()
```

### Performing Index-Based Substitutions
```
df.loc[0:3, 'py-score'] = 0
```