# Pandas Library

In [1]:
# import pandas as pd
import pandas as pd

# import numpy as np
import numpy as np

# import glob
from glob import glob

# import create_engine from sqlalchemy
from sqlalchemy import create_engine

# import web scraping functions
from urllib.request import urlretrieve, urlopen, Request

## General Info and Commands
#### Basic Commands
- Checking efficiency
    - `%%time`
        - add this as the first line of a cell to check run time and compare efficiencies of syntaxes
        - worth it to optimize code in many cases
- These commands assume a dataframe name of `df`
    - `len(df)` returns the number of rows in the `df`
- Methods
    - Print the 'head' or first 5 rows
        - `print(df.head())` use the '.head()' method of a dataframe
        - works without the `print()` call
        - supply an optional `int` as an arg to display that number of rows
    - Print the 'tail' or last 5 rows
        - `print(df.tail())`
        - works without the `print()` call
        - accepts optional int arg just as .head() does
    - Access the 'keys' of a dataframe
        - `df.keys()`
    - Access general info including number of non-null values and datatypes
        - `df.info()`
- Attributes
    - Print the 'columns' of a dataframe
        - `df.columns`
    - Print the 'indexes' of a dataframe
        - `df.index`
- Combining indexing with other methods
    - you have a dataframe that each key is associated with a dataframe (a dataframe of dataframes)
    - `print(df['key1'].head())` will print the head of the dataframe associate with 'key1'

## Summary Stats
- `df.describe()`
    - provides the following summary stats for each column
        - count, mean, std, min, 25% (1st quartile), 50% (median), 75% (3rd quartile), max
    - null entries are ignored for all stats
        - counts only include non null values
    - can supply an index `df.col.describe()` to restrict to specific columns
        - can also supply a list of columns using bracket notation
            - `df[['col1', 'col2']].describe()`
        - provides different results for categories
            - unique (num distinct entries), top (most frequent entry), freq (occurrences of top)
- `df['col'].unique()`
    - returns a list of unique values for the column
    - best used with categorical data
    - can count the number of these
- `df.col.nunique()`
    - returns the number of unique values
- `df.col.value_counts()`
    - returns each unique value along with the number of occurrences
- `df.quantile(q)`
    - where q is a fractional number
        - can supply a list of such numbers, output is provided in labeled rows
        - interquartile range (IQR)
            - `df.quantile([0.25, 0.75])`
    - provides the value at the specified quantile (0.25, 0.5, 0.75, etc.)
    - command above produces output for all numerical columns
    
#### Cross Tab
- `pd.crosstab(df.col1, df.col2)`
    - 'col1' and 'col2' should be categorical data
    - will output a frequency table with the number of occurrences where they overlap
        - i.e. 'F' and 'White', 'M' and 'White'...
    - the values from the first col passed will be outputted as the index, with the second as column labels

#### Group By
- Let's you run a function to get summary information
    - `df.groupby('col_name').count()`
        - can use other functions besides `count()`
            - `.size()`
                - will return the actual series, and provide count() data
            - `.mean()`
            - `.std()`
            - `.sum()`
            - `.median()`
            - `.first()` or `.last()`
            - `.max()` or `.min()`
            - `.idxmax()` or `.idxmin()` returns the row id/index for the max/min value
                - specify arg of `axis=columns` to look within a row and return the column header
            - `.nunique()` counts the number of unique values
            - etc.
        - **can do multiple agg functions at once using `.agg()`**
            - `df.groupby('col_name').agg([max, sum])`
            - can define and pass a custom function
                - this function must accept a series of values and return a single value
                - pass the function as an arg, without `()`
                - `df.groupby(['col1', 'col2'])[['value1', 'value2']].agg(custom_func)`
            - can use different aggregation functions by passing a dictionary to `.agg`
                - 'keys' of the dictionary are column names
                - 'values' of the dictionary are agg functions to use
                - `df.groupby('col_name')['value1', 'value2'].agg({'value1': 'sum', 'value2': custom_func})`
    - will 'group' data by the values in `'col_name'` and run the summary function
        - useful when 'grouping' by categorical data
    - can slice a single value from your `.groupby` column
        - `df.groupby('col_name')['value'].sum()`
            - can supply a list within the slice brackets (need two sets of brackets)
    - can use a multi-index for the `groupby` by passing a list as the `'col_name'` index
        - `df.groupby(['col1', 'col2'])['value'].mean()`
    - can grab a single column
        - `df.groupby('col1').col2.mean()
- Example to create a custom group by 'decade'
    - `df` contains a 'year' column, and I want to count a number of occurrences by decade
        - `df.groupby(df.year // 10 * 10).col1.size()`
        - the `.size()` agg function will display my year and the result (count()) just displays the result

## Creating and Converting DataFrames

#### Creating a Dataframe
- From a python dictionary as `dict`
    - `df = pd.DataFrame(dict)`
        - keys become column names, values remain values
        - row labels are auto generated, 0 indexed
        - values are broadcast to fill entire column if unequal length
            - `df = pd.DataFrame({'Names': list_of_names, 'Sex': 'M'})`
            - every row has a value of 'M' for 'Sex' column
- From python lists
    - generate a list of `column_labels`
    - generate a list of `values`, which is a list of lists
        - the inner lists hold all the values
        - the outer list holds variables for the inner lists
            - must be in the correct order for the labels
            - See example below for code
    - use `dict(list(zip(column_labels, values)))`
        - can break this up into two or more steps for readability
    - use `pd.DataFrame` on the dict created above
- From a CSV file
    - `df = pd.read_csv('path/to/file.csv')`
        - can set `index_col=0` if you don't want pandas to auto generate row numbers in an unnamed column
            - or name a column as a string `index_col='date'`
        - set `header=None` if there are no column headers
        - see 'Working with Flat Files' section below
        - can chain `.sort_index()` onto the `pd.read_csv().sort_index()` command
    - Formatting CSV for input
        - good practice to use unnamed col with row keys
        - can let pandas generate this, or set up CSV file with no col name and row keys in first col
- Adding a column to a dataframe
    - `df.loc[:, 'new_column'] = values`
        - can use calculated values, values from another column, or assign a single value for each row

#### Changing Index (row labels)
- Create a list as `list_of_indexes` with length equal to the number of rows
    - `df.index = list_of_indexes`
- Convert to time series index
    - `df.index = df[time_column]`
    - `df = df.sort_index` to then sort by that index
    - Another way, if you have a 'Date' column as a string in your df
        - `df.Date = pd.to_datetime(df['Date'])`
        - `df.set_index('Date', inplace=True)`
            - can modify the index this way without having to reassign it to the df
    - - **Parse the date and set as index during import**
        - set `index_col='date'` and `parse_dates=True` or just set 'index_col' to your date col if not parsing
- Rename your index column
    - `df.index.name = 'index_name'`
    - Delete index name
        - `df.index.name = None`
- Rename the name for your columns index
    - Your list of columns is an index that you can name
        - `df.columns.name = 'name'`
- Reindexing
    - Use to create a new df from a subset of another df
        - `new_df = old_df.reindex(columns=column_list)`
            - where column_list is a list of columns you want in the new df
    - Assigns a new index to a dataframe
        - `df.reindex(list[, method=])`
            - reindexes the dataframe using the supplied list
        - tries to match based on the old dataframe
            - you can supply the current indexes but in a new order
            - pandas will sort the entire df, keeping rows together
            - **you can even supply the index of another dataframe**
                - `df.reindex(df2.index)`
        - any missing values are filled with NaN by default
            - `method='ffill'`
                - forward fill, fills missing values with last non-null value
            - `method='bfill'`
                - backward fill, fills from next value backwards
- **Hierarchal Indexing** (multiple indexes)
    - **Set index and reset index**
        - `df = df.set_index([col1, col2][, drop=][, append=])` supplying multiple columns to use for the index
            - order of columns matters, put fewest values first
                - i.e. 'col1' can have several values in 'col2' for an index
                - like 'col1' is a date and 'col2' are times on that date
            - `drop=True` will remove the column from the data when it adds it as an index
                - default is False
            - `append=True` will add the column as an index, rather than replacing the index
        - `df = df.reset_index('col2')`
            - would remove 'col2' from the index and turn it back into a column
            - produces no weird effects like 'unstack' below
        - print the index names for this type of index
            - `print(df.index.names)` to print the list of names
        - `df = df.sort_index()` to then sort the index
    - **Selecting using loc** 
        - order is important for slicing
            - supply the index col names in the correct order
        - **Translations for selecting**
            - `index_value1` is a value from the 1st index column
            - `second_index_value1` is another value from the 1st index column
            - `index_value2` is a value from the 2nd index column
        - by supplying a tuple for your index
            - `df.loc[(index_value1, index_value2), :]`
                - all rows at that index
            - `df.loc[(index_value1, index_value2), 'column_name']`
                - returns a single value at the intersection
            - `df.loc[(['index_value1', 'second_index_value1'], 'index_value2'), 'column_name']
                - can also use `:` instead of `'column_name'` to select all columns
                - returns rows with index 1 values in the supplied list and index 2 value
            - `df.loc[('index_value1', ['index_value2', 'second_index_value2']), 'column_name']
                - returns all rows with first index of `index_value1` and in list from index 2
                - can limit to certain columns or select all at `column_name` location
        - slice by an outer index only
            - `df.loc['index_value1']`
                - returns all rows where the outer index has the supplied value
            - `df.loc['index_value1':'second_index_value1']`
                - can select a range of values from an index
        - slice by both indexes
            - must use the `slice()` function
                - `df.loc[(slice(None), slice('index_value2', 'seond_index_value2')), :]`
                    - `slice(None)` will select all from the first index
                    - `slice('index_value2', 'seond_index_value2')` will slice only those values inclusive
                    - can also just pass a single value rather than the `slice()` function in the tuple
    - **Unstacking a multi-index**
        - 'U' from unstack means moving an index level 'up' to a column level
        - Remove an index and will add an 'index level' to the columns
            - if you 'unstack' the 'sex' index, each column will now be divided into the diff 'sex' values
        - `df.unstack(level='index_name')`
            - can supply the name, or the index level (0 indexed) as an int
            - often chain `.fillna(value)` to specify how to handle NaN values
                - sometimes '' empty string can be useful for NaN values
        - see `reset_index()` above if 'unstack' is not producing desired effects
        - **Stacking and unstacking** works to move columns to indexes and vice versa
        - **Set index and reset index** works to move the data in the columns to indexes and vice versa
    - **Stacking**
        - can only do this when you have hierarchal columns
            - i.e. each column is divided into 'M', and 'F'
            - turns this level into an inner index level
        - `df.stack(level='index_name')`
    - **Swapping Index Levels**
        - `df.swaplevel(index_int1, index_int2)`
            - supply the int for the index i.e. (0,1)
        - probably need to `df.sort_index()` after this

#### Changing Column Labels
- `df.columns = list_of_labels`
    - the `list_of_labels` must include the correct number of labels
    
#### Convert Columns to Different Data Types
- Useful to convert 'string' data types into 'categories' if only a few unique values
- Check the number of unique values in a column
    - `df['col_name'].unique()` 
        - displays all of the unique values in that column as a list
- Convert the type to categorical if reasonable to do so
    - `df['col_name'] = df['col_name'].astype('category')`
        - this saves memory and increases processing speed for operations on the df

#### Convert Dataframe to Numpy Array
- `array = dataframe.values`
    - these "values" must all be the same type
    
#### Sorting by a column
- `df.sort_values('column_name'[ascending=])`
    - default is `ascending=True`

## Selecting Elements in a Dataframe
#### Select rows or columns, dataframe as `df`
- Selecting based on indexes works faster if your indexes are sorted
    - `df = df.sort_index()` prior to running (or can save as `df_sorted` if needed)
- Selecting values
    - use the `.values` attribute to return the values as a numpy array
    - can use this on a single column or selection
        - `df_series_obj = df['column_name']`
        - `np_array = df.series_obj.values`
        - or simply `np_array = df['column_name'].values`
    - select a single 'cell'
        - `df['col_name']['row_index']` or `df.col_name['row_index']`
- Select your index values
    - `idx_vals = df.index.values`
        - returns a list of the index values for each row
- Select entire column(s) using col names
    - `df['column_name']` or `df.column_name`
        - returns a pandas series object, with extra info in it (not a dataframe)
    - `df[['column_name']]`
        - returns a dataframe object
        - can supply multiple column names in the supplied list
- Selecting row(s) with slicing
    - `df[start:end:stride]`
        - specify the index for *start* and *end*
        - works like slicing a list
        - returns the *start* index and stops at index before the *end* index
        - leave *start* empty to start at the beginning
        - leave *end* empty to slice until the end
        - *stride* is the frequency of elements to choose (blank=1)
- Selecting rows/cols with slicing
    - `df[col][row]`
        - `col`
            - can simply be a column name
            - can use `[start:end:stride]` syntax
        - `row`
            - can simply be a row name
            - can use `[start:end:stride]` syntax
- Selecting time series indexed rows with slicing
    - `df.loc['datetime']`
        - supply the datetime as a string
        - if date and time info, can supply a date, to select all rows/times from that date
            - can supply just the year, year-month, year-month-day
                - will select everything that applies
    - `df.loc['start':'end']`
        - selects everything in the range, including the 'stop' datetime
        
#### **Selecting Rows and Columns**
- `reindex` to subset a df
    - `new_df = old_df.reindex(columns=column_list)`
        - where `column_list` is a list of columns to include
- loc
    - `df.loc['row_label', 'col_label']`
        - returns a single value
    - `df.loc['row_label']`
        - returns a pandas series object
        - values are returned as a list
    - `df.loc[['row_label']]`
        - returns a pandas dataframe object
        - can include multiple row labels in the list
    - `df.loc[['row_label'], ['column_label']]`
        - can supply a second list containing column labels to select
    - `df.loc[:, ['column_label']]`
        - selects all rows and the columns you specify in the second list
    - `df.loc[rows, cols]`
        - can use `[start:stop:stride]` for rows/cols, supply names still
            - the `stop` column is included, as it is named
            - returns a range from start to stop inclusive
        - can use `:` for all rows/cols
        - can supply a single label to select that row/col
        - can supply a list
    - `df.loc[(index1, index2)]`
        - multi-index selections
            - string indexes should go in ''
- iloc
    - Syntax for `row`, `column`, and `stride`
        - `row` & `column`
            - supply an int to access a single row/column
                - an int for both will return a single value
            - supply a list within `[]` to access multiple rows/columns
            - using `:`
                - `[start:stop]`
                - `[:3]` start at the beginning and include 3 rows/columns (not inclusive of index 3)
                - `[3:]` start at index 3 and include the rest of the values
    - same as loc, except you supply indexes rather than names
        - `df.iloc[row, column, stride]`
        - supply `row`, `column`, `stride` values within `[]` as a list to return a dataframe
    - can join with boolean indexing
        - `df.iloc[:, [True, False, True, False]]`
            - returns the columns at indexes 0 and 2
            - list of booleans must match the number of columns
    - **Lambda Functions**
        - `df.iloc[lambda x: x.index % 2 == 0]`
            - selects only even rows
        - `df.iloc[:, lambda df: [0, 2]]`
            - selects all rows and columns at indexes 0 and 2
    - Examples:
        - `df.iloc[1, 2]`
            - select value in second row, third column
        - `df.iloc[[1]]`
            - select the 2nd row as a dataframe
        - `df.iloc[[0, 1, 2], [0, 2]]`
            - select the first 3 rows and the first/third column as dataframe
        - `df.iloc[:, [4, 5]]`
            - all rows in the 5th/6th columns
        - `df.iloc[1:3, 0:3]`
            - rows 1 and 2, columns 0, 1, and 2

#### Zeros and NaNs
- Select only cols with all nonzero values
    - `df.loc[:, df.all()]`
- Select only cols with any nonzero values
    - `df.loc[:, df.any()]`
- Select only cols with all null values
    - `df.loc[:, df.isnull().all()]`
- Select only cols with any null values
    - `df.loc[:, df.isnull().any()]`
- Select only cols with no null values
    - `df.loc[:, df.notnull().all()]`
- Select only cols with any non-null values
    - `df.loc[:, df.notnull().any()]`
- Counting nulls
    - `df.isnull()` will create a dataframe of 'True' 'False' values
    - `df.isnull().sum()` will add all of the nulls up (True=1)
        - compare results to `df.shape` to see how much is null
- Remove nulls
    - `df.dropna(how=)` works on rows by default
        - `how='any'` will drop any **rows** that have at least 1 null value
        - `how='all'` requires all values in a row to be null for dropping
        - `thresh=size`
            - will drop all with 'size' or more number of null values
        - `axis='columns'` will drop columns rather than rows
        - `subset=[]`
            - provide a list of columns that only drops rows that are missing the specified columns
            - i.e. `subset['a', 'b']` will drop rows if there are nulls in either col 'a' or col 'b'
            
#### Dropping Columns
- You can drop entire columns
    - store the result in a new df
- `df_dropped = df.drop(drop_col_list, axis='columns')`
    - easy way to subset data if only wanting certain columns
    - add `inplace=True` to drop without having to reassign to variable

#### Boolean Indexing
- Combine conditionals with indexing
- See above for **combining with iloc**
- Return a series object full of booleans
    - `df['column_name']conditional`   # conditional something like `< value`
        - can supply whatever conditional you choose
        - must use this 'bracket' syntax when col names have special chars like '/'
    - `df.column_name` also works when 'column_name' doesn't have special chars in it
- Can use the series object as the index to return a dataframe that only selects 'True' values
    - Use directly as the index
        - `df[df.column_name < value]`
    - Store in a variable first
        - `bool_index = df.column_name = value`  # returns a series object
        - `df_new = df[bool_index]` # returns a dataframe
    - Select only certain cols while filtering using others
        - `df.col1[df.col2 > x]`
- Logical operators
    - Use numpy logical and/logical or
        - `bool_index = np.logical_and(df.column_name > 8, df.column_name < 20)`  # returns a series object
        - `df_new = df[bool_index]`  # returns a dataframe
        - Use `np.logical_or` the same way
    - Use '&', '|' instead of np logicals
        - `df[(df.col1 == x) & (df.col2 > y)]` like `np.logical_and`
        - `df[(df.col1 == x) | (df.col2 > y)]` like `np.logical_or`
    - Use '!' for not
        - `df[df.col != x]` where 'not equal' to 'x'
    
#### Assigning New Dataframe Using Boolean Indexing on Data
- `filter = df.col == value` or `df['col'] == value`
    - 'value' can be a string used to divide the dataframe (like 'SC' or 'setosa')
- `df_filtered = df.loc[filter, :]`
    - extracts the filtered data into a new dataframe
- All at once
    - `df_filtered = df[df.col == value]`
    
#### Assign New Values to a Dataframe Based on a Condition
- `df.col1[df.col2 < x] = np.nan`
    - assigns 'NaN' to 'col1' when 'col2' is less than 'x'

In [2]:
%%time
# create a dataframe from a dictionary
colors = ['Brown', 'Green', 'Black', 'Yellow']
spanish = ['Cafe', 'Verde', 'Negro', 'Amarillo']
length_colors = [5, 5, 5, 6]
length_spanish = [4, 5, 5, 8]
dict1 = {'Colors': colors, 'Spanish': spanish, 'Length_Colors': length_colors, 'Length_Spanish': length_spanish}
df = pd.DataFrame(dict1)
print(df)
print()

# alt syntax to stitch a dataframe from lists
# using same lists above
# this method is more scalable
labels = ['Colors', 'Spanish', 'Length_Colors', 'Length_Spanish']
cols = [colors, spanish, length_colors, length_spanish] # list of lists
zipped = list(zip(labels, cols))
dict2 = dict(zipped)
df2 = pd.DataFrame(dict2)
print(df2)
print()

# select all columns and rows that have spanish != 5
bool_index = df['Length_Spanish'] != 5
print(df[bool_index])
print()

# select both length cols where Length_Colors is 5 or more and Length_Spanish is 5 or more
bool_index = np.logical_and(df['Length_Colors'] >= 5, df['Length_Spanish'] >= 5)
df_subset = df[bool_index]
print(df_subset.loc[:, ['Length_Colors', 'Length_Spanish']])
print()

# select the last two rows and first two cols
print(df.iloc[2:,0:2])

   Colors   Spanish  Length_Colors  Length_Spanish
0   Brown      Cafe              5               4
1   Green     Verde              5               5
2   Black     Negro              5               5
3  Yellow  Amarillo              6               8

   Colors   Spanish  Length_Colors  Length_Spanish
0   Brown      Cafe              5               4
1   Green     Verde              5               5
2   Black     Negro              5               5
3  Yellow  Amarillo              6               8

   Colors   Spanish  Length_Colors  Length_Spanish
0   Brown      Cafe              5               4
3  Yellow  Amarillo              6               8

   Length_Colors  Length_Spanish
1              5               5
2              5               5
3              6               8

   Colors   Spanish
2   Black     Negro
3  Yellow  Amarillo
Wall time: 36.9 ms


## DTypes
- Memory usage
    - `df.memory_usage()` can also run on a specific column
        - `index=False` default is true
            - displays index memory usage as well
        - `deep=True` specifically looks into 'object' dtypes
- There are a myriad of reasons to use the proper dtype
    - math, datetime operations, categories use less memory and run faster, bools are special too
- `df.dtypes`
    - will print your datatypes for each column
- `df.column_name.dtype` or `df['column name'].dtype`
    - will print the dtype for a single column
- dtypes
    - object
        - `dtype('O')` is object dtype, which means numbers can be stored as strings (not what you want)
    - float
        - can do math
    - int
        - can do math
    - datetime
        - use `df['col_name'] = pd.to_datetime(df['col_name'])`
    - bool
        - t/f
    - category
        - use whenever you can
        - use on 'object' dtypes when the data are strings with relatively few unique values
        - `df['col'] = pd.Categorical(df.col, ordered=True, categories=cat_list)`
            - you should first define a list of categories in ascending order in `cat_list`
            - `ordered=False` by default
                - ordering allows the use of comparison operators < & > 
- Changing dtypes
    - `df['column_name'] = df.column_name.astype('dtype')`
        - must use bracket notation on the left to overwrite a series
        - see 'category' section above for extra args


## Merging DataFrames
#### Reading Multiple Files
- Reading multiple files using a loop or comprehension
    ```python
    filenames = ['file1.csv', 'file2.csv']

    dataframes = [pd.read_csv(f) for f in filenames]
    ```
- Using the glob module
    - `from glob import glob`
    - `filenames = glob('filename*.csv')`
        - will grab every file that has the above format with other char(s) in the wildcard '\*' spot
    - use the `filenames` iterable object in a list comprehension as above
#### Appending and Concatenating Series and Dataframes
#### Append
- `series1.append(series2)`
    - will append series2 data to the end of series1
    - columns matching
        - if columns match and the index name matches
            - will stack data nicely
        - if columns don't match and/or index name is different
            - will stack, will not merge records in both df/series
            - will add NaN for the columns not in each respective df/series
    - also workes with dataframes
    - keeps the index from the original df/series
        - can result in duplicate indexes, so may need to `.reset_index(drop=True)`
        - `drop=True` discards the old index with repeated entries
        
#### **Concat**
- `df_complete = pd.concat([df1, df2, df3...][, ignore_index=True][, axis=][,keys=][, join=])`
    - accepts a list of dataframes or series to add together
    - using `ignore_index=True` adds only values and increments the index int
        - be careful with this if index is more than just an integer
        - default is False, which keeps original index
    - `axis='rows'` by default and will stack the df/series
        - same as `axis=0`
        - will stack the df/series even if the indexes match, and can repeat index labels
            - can specify `keys=[]` arg to add an index level to distinguish rows
    - `axis='columns'` will add columns to the df trying to match to the correct index
        - same as `axis=1`
        - will add NaN values for cols in the first df if they don't exist
        - will match on index value even if indexes have different names in the respective df's
        - not sure if this works with series or not
    - `keys=`
        - adds an outer index level
        - can supply a value or a list of values to use as an index for each concatted df/series
        - length/order of the list should be the same as the number of items you are concatenating
        - can add a second level index, useful when using `axis=0`
        - will also add an outer index level for columns, useful when column names match
    - use a dictionary to specify `keys` argument
        - `dict_keys = {key1: df1, key2: df2}`
        - `pd.concat(dict_keys, axis=1)` will add `key1` and `key2` as outer indexes above column names
    - `join=` 
        - default is 'outer' in which all data is preserved, and NaN's are filled for missing areas
        - `join='inner'` will only join where data are matched, just like SQL inner join
        
#### **Merge**
- Use merge when you need to match your dataframes on a column other than the index
    - or if you need more advanced control over the merging of dataframes
- `pd.merge(left=df1, right=df2[, on=][, how=][, suffixes=])`
    - default is `axis=1` trying to join all columns into one df by matching rows
    - default is `join='inner'` only matching columns in all df's
    - `on='column_name'` **or use** `left_on='', right_on=''` **when column_names are different**
        - specifies the column(s) to match
            - supply a list to match on multiple columns
            - can supply lists of equal length to `left_on` and `right_on` when needed
        - default action is to try to match all common columns
            - if the columns are the same in all df's, then will try to match all columns
        - columns that are shared will have names modified by appending '\_x' or '\_y' etc.
    - `suffixes=[]`
        - supply a list of suffixes in order (length should be the same as the number of df's merging
        - will use the suffixes supplied rather than '\_x' or '\_y' for shared column names
    - `how=''`
        - `how='inner'` by default, doing an inner join
        - `how='left'` works just like a 'LEFT JOIN' with SQL
            - rows are matched when they match, when they don't NaN's are filled
        - `how='right'` works just like a 'RIGHT JOIN' with SQL
        - `how='outer'` keeps all rows from both df's, filling NaN's when not matched
- `pd.merge_ordered()`
    - works similarly to `pd.merge()` but will order the results
    - **default `how()` for this type is *'outer'*!
    - **Filling NaNs**
        - `fill_method=` can be specified
            - 'ffill' and other types of fills accepted to deal with NaN's
- `pd.merge_asof()`
    - sort of like `merge_ordered()` but...
        - only keeps right df rows whose 'on' values are less than the left df values

#### Pandas `.join()`
- `.join()` works similarly to `.merge()`
    - `df1.join(df2[, how=])`
        - default `how=` is 'left' and works just like `pd.merge(how='left')`
        - other `how=` types work just like `.merge()`

## Working with String Data
- Can chain `.str` methods to work with string data
- `df['str_col'].str.upper()`
    - returns a series (does not modify the data in place) with all caps
    - `df.index = df.index.str.upper()` will convert index strings to all caps
- `df['str_col'].str.contains(substring)`
    - returns a series of True/False values based on whether each entry contains the supplied substring
    - chain `.sum()` to the end of this, to get a count of the number of occurrences
        - since True = 1
- `df['str_col'] = df['str_col'].str.strip()`
    - strip whitespace from strings
    - works on column names, just use `df.columns`

## Working with Category DType
- `df.col_name.value_counts()` returns each unique value and the number of occurrences in the column
    - `normalize=True` will return the percentage as a decimal of each occurrence
    

## Working with Numbers
- May need to clean some data
    - `df['column'] = pd.to_numeric(df['column'], errors='coerce')`
        - `errors='coerce'` will force the conversion adding NaN for non-numerics
        
#### Broadcasting Mathematical Operations
- Example: a series with two cols of numeric data, and a second series with one col of numeric data
    - `ser_w2.divide(ser_w1, axis='rows')`
        - axis='rows', because want to divide both entries in row 1, by the number in row 1 in the second df
- Some functions
    - `.add()`
    - `.divide()`
    - `.pct_change()`
        - calculates the pct change as a decimal from the previous entry
    - options
        - `fill_value=` default is NaN, which results in NaNs if only 1 value is missing

#### Simple arithematic
- can just add, subtract etc. df or series
    - will match on indexes, but return NaN along with indexes that aren't found in every df/series
- recall `//` will return the result of division, truncating the value if there is a remainder
    - i.e. `1983 // 10 = 198`

## Working with Datetimes
- `pd.to_datetime(list, format='')`
    - convert a list of values to datetimes
    - can supply a format string (using standard format chars like %Y-%m ...)
    - may need to convert the data to string format first and format properly
        - `df['date'] = df['date'].astype(str)`
        - `df['time'] = df['time'].apply(lambda x: '{:0>4}'.format(x))`
            - pad leading zeros on time if necessary
        - `dt_string = df['date'] + df['time']`
        - `date_times = pd.to_datetime(dt_string, '%Y/%m/%d%H%M')`
        - `df_clean = df.set_index(date_times)`
- `df['date_col'].dt.hour`
    - can use datetime attributes to access pieces of a datetime from each value
        - this example will return a series containing only the hour for each datetime
- Combine dates and times into one column when in separate columns
    - concat them as strings then assign to df column
        - `combined = df.date_col.str.cat(df.time_col, sep=' ')`
        - `df['dt'] = pd.to_datetime(combined)`
        - `df.set_index('dt', inplace=True)` set the datetime to the index if desired
- Convert Timezones
    - make datetimes 'aware' by setting a local timezone
        - `aware_dates = df['date_col'].dt.tz_localize(timezone_string)`
            - where `timezone_string` is in the proper format ('US/Central')
    - convert datetiems
        - `eastern_dates = aware_dates.dt.tz_convert('US/Eastern')`
    - can chain the entire thing together (must repeat the `.dt.` part)

## Data Transformations (Vectorized Computations)
- The methods below return a dataframe without modifying the original
    - the results must be stored, saved, or used
    - can store in the dataframe being manipulated by assigning to a new column
- Pandas Built-ins
    - Simply do arithmetic on cols
        - `df['sum_of_1_2'] = df.col1 + df.col2`
    - Apply a custom function to a dataframe (does not work on the index, see map below)
        - `df.apply(function)`
            - you must first define your function or supply a lambda function
            - supply function name without '()' or args
            - function must accept one arg, and that arg is each entry in the df
    - Apply a custom function to the index of a dataframe
        - `df.index.map(function)`
            - does not modify the index in place
                - must `df.index = df.index.map(function)`
            - function works like with `.apply`
                - no '()' or args passed
                - must accept one arg (each entry of the index)
                - lambda functions ok
    - 'Map' dictionary values to keys found in the df (**map method**)
        - `df['new_col'] = df.col.map(dict)`
            - will search `df.col` for the keys in the dictionary
            - will add the dict 'value' to `df.new_col` for each row corresponding to its 'key' in `df.col`
    - Divide by a number and round down
        - `df.col.floordiv(num)`
            - divides all values by the supplied 'num' and rounds down
            - useful for questions like 'how many dozen' (12 for num)
- Numpy ufuncs
    - Divide by a number and round down
        - `np.floor_divide(df, num)`
            - works like pandas `.floordiv()` above

#### Transforming Using groupby
- Can apply a transform function to a grouped set
    - `df.groupby('col_name')['value'].transform(func)`
        - will alter the data in place using the supplied 'function'
            - this function should accept a series and return a series
                - aka, it takes a value and transforms it, then does so for the next value
                - example `zscore`, which is the # of std away from the mean for each value
                - will transform each value in the 'value' column, grouped by 'col'
                
```python
def zscore(series):
    return (series - series.mean()) / series.std()
df.groupby('col')['value'].transform(zscore)
```
- can use `.apply(func)` when using more complex functions rather than `.transform()`
    - example to transform one column and return the whole df
```python
            def zscore_details(group):
                df = pd.DataFrame(
                        {'zcol1': zscore(group['col1']),
                         'col2': group['col2'],
                         'col3': group['col3']})
            return df
        
            df.groupby('col4').apply(zscore_details)
```
- `col2` and `col3` are returned as is, but col1 is transformed  
        
            

## Resampling
- Often used with datetimes for summary info
    - mean, count, sum, etc.
- `df.resample(freq)` usually chained with a stat method call
    - `freq` is a string to specify the frequency you want
        - 'min' or 'T' for minute
        - 'H' for hourly
        - 'D' for daily
        - 'B' for business day
        - 'W' for weekly
        - 'M' for monthly
        - 'Q' for quarterly
        - 'A' for annually
    - can add an integer to specify every 2 or 3 days etc.
        - `df.loc[:,'col'].resample('2W').mean()`
    - should chain summary functions to the resample call
        - `df['col'].resample('D').mean()`
- Chaining multiple methods is possible
    - `df.resample('M').sum().max()`
        - returns the maximum monthly sum
- Downsampling
    - reducing datetime rows to a slower frequency (yearly to monthly)
    - no additional methods need to be chained beyond those desired
- Upsampling
    - increasing the frequency (weekly to daily)
    - need to tell pandas how to fill the extra data
    - `df.loc['yyyy-dd-mm':'yyyy-dd-mm', 'col'].resample('4H').ffill()`
        - will make the time appear every 4 hours, (even if all you had was daily)
        - will forward fill using values from previous times until a new value is encountered
            - good for running totals
        - `fill_method` options
            - `ffill` forward fill
            - `bfill` backward fill
            - `pad` in between forward and backward
            - `first` only keeps the actual values, and fills with NaN values
    - interpolating data rather than filling
        - `df.resample('A').first().interpolate(how='linear')`
            - will use a linear interpolation to make a coarse time series yearly
        - use `first` as the fill method for upsampling
        - use `.interpolate(how=type)` to specify how to fill
            - where `type` is a string specifying how to fill
- Rolling Mean
    - `data.rolling(window=).mean()`
        - calculates a smooth 'rolling' mean for you data or data slice
        - `window=24` will compute new values for each hourly point
            - based on a 24 hour window stretching out behind each point
        - `window=7` after a daily resample will do it daily
            - still trying to figure out how the `window` arg works
            - `data.resample('D').max().rolling(window=7).mean()`
                - calculates the rolling mean over 7 days of the daily maximum (I think)
                - so you still get a daily max, but not until day 7
                    - means are smoothed by using 7 daily maxes prior to each day in the mean

## Broadcasting
- Works like numpy broadcasting
- Syntax:
    - `df['column_name'] = value`
        - every row in 'column_name' has value of 'value' now
        - this 'column_name' can be a new or existing column

## Looping Through Dataframes

- Column Labels
    - `for i in df: statements`
- Rows  # need to use `.iterrows()` method
    - `for index, row in df.iterrows(): statements`
        - `index` refers to the row labels
        - `row` refers to a series object including col name and values

In [3]:
# using the df created above
# note that there is no col label for the row labels

for i in df:
    print(i)

Colors
Spanish
Length_Colors
Length_Spanish


In [4]:
# access columns
# .ljust(width) helps with alignment and spacing
for index, row in df.iterrows():
    print(('Row: ' + str(index)).ljust(10), ('Spanish: ' + str(row[1])).ljust(20), 'Spanish Length: ' + str(row[3]))

Row: 0     Spanish: Cafe        Spanish Length: 4
Row: 1     Spanish: Verde       Spanish Length: 5
Row: 2     Spanish: Negro       Spanish Length: 5
Row: 3     Spanish: Amarillo    Spanish Length: 8


In [5]:
# loop through a column's values
for index, row in df.iterrows():
    print(row[3] + 10)

14
15
15
18


#### Creating a New Column Based on a Calculation
- Using `.iterrows()` can do the same thing, but it's less efficient and best on small dataframes
- `df['new_column'] = df['column'].apply(function)`
    - supply a new column name and a function to use

In [6]:
# create a new column from the calculation above
df['color_lower'] = df['Colors'].apply(str.lower)
print(df['color_lower'])
print()

# apply a numeric function to a column to create a new column
def add_ten(x):
    return x + 10

df['length_plus_ten'] = df['Length_Colors'].apply(add_ten)
print(df['length_plus_ten'])

0     brown
1     green
2     black
3    yellow
Name: color_lower, dtype: object

0    15
1    15
2    15
3    16
Name: length_plus_ten, dtype: int64


## Reshaping Data Frames
- Melt, pivot, and pivot table are outlined in the Data Cleaning section of the Data Science file

## Working With File Types

#### Saving a DataFrame into a File
- `df.to_csv(filename[, sep])`
    - where `filename` is a string of your filename
    - `sep` can set a delimiter to other than comma (default)
        - `sep='\t'` for tab separated (use '.tsv' rather than '.csv' in file name)
- `df.to_excel(filename)`
    - name `filename` with '.xlsx' extension

#### Importing Flat Files
- Such as csv and txt files with rows and cols
- `dataframe = pd.read_csv(filename[, sep][, comment][, na_values][, nrows][, header][, names][, parse_dates)`
    - `index_col='col'`
        - can also supply a column name to use as the index
        - useful when parsing dates (see below)
    - `sep` is pandas version of delimiter with default `','`
    - `comment` takes the char that comments appear after (for python it's '#')
    - `na_values` 
        - takes a list of strings to identify/replace with NaN
            - blank spaces preceding values in the data can affect this
            - so check for them if experiencing issues
        - can accept a dictionary using column names as indexes and lists of values to replace for the value
            - `na_values={'col1':['  -1', ' -1', '-1'], 'col2':['no_data', 'N/A']}`
    - `nrows` specifies an integer for the number of rows to retrieve
    - `header=None` if no header
        - `names=col_names` where 'col_names' is a list of header names for each column
        - the supplied list should have the same length as the number of columns
    - `parse_dates=[[index1, index2, index3]]`
        - can also set `parse_dates=True` and see what happens
            - combine with `index_col='date'` to index by these dates
        - intelligently parses the date entries from each supplied index and combines them into one datetime
        - use integers for the indexes (i.e. `parse_dates=[[4, 5]]`) to specify output of parsing
        - use a column name for the indexes (i.e. `parse_dates=[['year', 'month']]`)
            - use a single column name to keep the datetime together `parse_dates=['date']`
        - it may even parse the column name, need to test
    - view the header and first 5 lines of the dataframe with `.head()` method `dataframe.head()`
    
    - **Parse the date and set as index during import**
        - set `index_col='date'` and `parse_dates=True`

#### Iterating Through Large Files
- Simple example using chunking to record each unique value and it's number of occurrences
    - Initialize empty dictionary
        - `dict1 = {}`
    - Iterate over the file
          `for chunk in pd.read_csv(filevariable, chunksize=100):`
              `# iterate over a column in the file`
              `for entry in chunk['col_name']:`
                  `if entry in dict1.keys():`
                      `dict1[entry] += 1`
                  `else:`
                      `dict1[entry] = 1`
    - Convert to a dataframe
        - `df = pd.DataFrame(dict1)`


#### Complex Example
- Use a reader object to read the files a specific number of lines at a time
    - `file_name_reader = pd.read_csv('filename', chunksize=num)`
        - common to store filename in a variable and use the variable
        - *num* is the number of lines to read, 1000 is a good number
- Initialize empty df
    - `data = pd.DataFrame()`
- Iterate over each chunk
    - `for grp in file_name_reader:` 
          `filtered_data = grp[grp['col_of_interest'] == condition]`
          # exclude all data not meeting condition
- Zip any columns you want
    - `data_zip = zip(filtered_data['col_name1'], filtered_data['col_name2])`
- Convert zip object to a list
    - `data_list = list(data_zip)`
- Create new dataframe column (this example does a calculation on the two columns to get a %)
    - use list comprehension if needed to create your new column
        - `filtered_data['new_column'] = [int(tup[0] * tup[1] * 0.01) for tup in data_list]`
- Append this 'chunk' to the dataframe
    - `data = data.append(filtered_data)
- Can nest this entire thing in a function to call by supplying relatively few parameters
    - DataCamp example similar to this in old `Python Library.docx` file

#### Excel Files
- `datafile = pd.ExcelFile('filename')`
- view different sheets in the file/dataframe
    - `print(datafile.sheet_names)`
    - use `.sheet_names` attribute of this object
- extract a sheet into a dataframe
    - `dataframe = datafile.parse(sheet[, skiprows][, names][, usecols])`
        - `sheet` supply sheet name as a str or index as float (0 indexed)
        - the following args must be in list format
            - `[arg]` if only supplying one value
        - `skiprows` supply a list of rows to skip (0 indexed)
        - `names` supply a list of names for your imported columns
        - `usecols` supply a list of columns to import (0 indexed)
- Read Excel file and store each sheet as a dataframe with sheet names as the keys to each individual dataframe
    - `df = pd.read_excel('filename', sheetname=none)`
        - can specify a 'sheet' or if sheet='none' will save all sheets using sheet names as keys
        - can use a 'url' as the 'filename' to scrape data from the web

#### SAS and Stata Files
- SAS Files
    - `from sas7bdat import SAS7BDAT`
    - `with SAS7BDAT('filename.sas7bdat') as file:`
          `dfsas = file.to_data_frame()`
- Stata Files
    `data = pd.read_stata('filename.dta')`

#### HDF5 Files
- HDF5 is becoming the industry standard for big data sets
- hierachy of key values, where a value here then becomes a key
- `import h5py`
  `filename = filename.hdf5`
  `data = h5py.File(filename, 'r')`
- exploring data structure
    - `for key in data.keys():`
      `print(key)`
    - provides keys that can be accessed such as 'meta' for metadata
    - access its contents
        - `for key in data['meta'].keys():`
          `print(key)` returns another key in this example 'Description'
    - accessing values
        - `data['meta']['Description'].value`

#### Scraping Data fro the Web
- Some functionality using the 'urllib' package
    - `from urllib.request import urlretrieve, urlopen, Request`
    - import not necessary when using some of the functions below
- Import data into a dataframe using a url
    - `url = 'http://....filename.csv'`
    - `df = pd.read_csv(url, sep=';')` using the appropriate separator (delimiter)
    - `df = pd.read_excel(url, sheetname=none)` 

## Working with Databases
- Need to import the appropriate package
    - `from sqlalchemy import create_engine`
- Creating an engine (sqlalchemy package)
    - `engine = create_engine('sqlite:///db_name.sqlite')`
        - above syntax `'db_type:///db_name.extension'`
- Running a query using Pandas
    - `df = pd.read_sql_query("SELECT * FROM table_name", engine)
    - `engine` is the engine to connect to (see above)

## Loading and Concatenating Data into a Dataframe from Many Files
#### Concatenating Dataframes Using Pandas
- Useful when combining data sources
- `df_concat = pd.concat([df1, df2], axis=0,ignore_index=True)`
    - works well when your dataframes have the same columns in the same order
        - adds rows, keeping your columns
    - `axis` is optional
        - `axis=0` is default, and adds rows to your columns
        - `axis=1` will add new columns on the right of the dataframe, matching on row index
    - `ignore_index` is optional
        - default is `ignore_index=False` and keeps original index values (produces duplicate row indexes)
        - `ignore_index=True` will reindex the new dataframe
- Use `glob` to find files based on a pattern
    - need to `import glob`
    - useful when trying to process thousands of files for concatenation
    - uses **wildcards** to help matching
        - `*` matches zero or more of any char
        - `?` matches any single char in that position
        - `[ ]` matches chars specified within
            - `[0-9]` matches number 0-9
            - `[09]` mathces 0 and matches 9
    - creates a list of file names that match your pattern
    - Example:
        - `csv_files = glob.glob('*.csv')` will store a list of all csv files
- Example: to combine these skills to create a large dataframe from many files
    - `list_data = []`
    - `for filename in csv_files:`
        - `data = pd.read_csv(filename)`
        - `list_data.append(data)`
            - this results in a list of dataframes, which can be loaded into `pd.concat`
    - `df = pd.concat(list_data)`