# Pandas Library

In [3]:
# import pandas as pd
import pandas as pd

# import numpy as np
import numpy as np

# import create_engine from sqlalchemy
from sqlalchemy import create_engine

# import web scraping functions
from urllib.request import urlretrieve, urlopen, Request

## General Info and Commands
#### Basic Commands
- These commands assume a dataframe name of `df`
- Print the 'head' or first 5 rows
    - `print(df.head())` use the '.head()' method of a dataframe
    - works without the `print()` call
- Print the 'tail' or last 5 rows
    - `print(df.tail())`
    - works without the `print()` call
- Access the 'keys' of a dataframe
    - `df.keys()`
- Combining indexing with other methods
    - you have a dataframe that each key is associated with a dataframe (a dataframe of dataframes)
    - `print(df['key1'].head())` will print the head of the dataframe associate with 'key1'

## Creating and Converting DataFrames

#### Creating a Dataframe
- From a python dictionary as `dict`
    - `df = pd.dataframe(dict)`
        - keys become column names, values remain values
        - row labels are auto generated, 0 indexed
- From a CSV file
    - `df = pd.read_csv('path/to/file.csv')`
        - can set `index_col=0` if you don't want pandas to auto generate row numbers in an unnamed column
    - Formatting CSV for input
        - good practice to use unnamed col with row keys
        - can let pandas generate this, or set up CSV file with no col name and row keys in first col

#### Changing row labels
- create a list as `l` with length equal to the number of rows
    - `df.index = l`

#### Convert Dataframe to Numpy Array
- `array = dataframe.values`
    - these "values" must all be the same type

## Selecting Elements in a Dataframe
#### Select rows or columns, dataframe as `df`
- Select entire column(s) using col names
    - `df['column_name']`
        - returns a pandas series object, with extra info in it (not a dataframe)
    - `df[['column_name']]`
        - returns a dataframe object
        - can supply multiple column names in the supplied list
- Selecting row(s) with slicing
    - `df[start:end]`
        - specify the index for *start* and *end*
        - works like slicing a list
        - returns the *start* index and stops at index before the *end* index
        - leave *start* empty to start at the beginning
        - leave *end* empty to slice until the end
- loc
    - `df.loc['row_label']`
        - returns a pandas series object
    - `df.loc[['row_label']]`
        - returns a pandas dataframe object
        - can include multiple row labels in the list
    - `df.loc[['row_label'], ['column_label']]`
        - can supply a second list containing column labels to select
    - `df.loc[:, ['column_label']]`
        - selects all rows and the columns you specify in the second list
- iloc
    - same as loc, except you supply indexes rather than names
    - `df.iloc[[1]]`
        - select the 2nd row as a dataframe
    - `df.iloc[[0, 1, 2], [0, 2]]`
        - select the first 3 rows and the first/third column as dataframe
    - `df.iloc[:, [4, 5]]`
        - all rows in the 5th/6th columns

#### Boolean Indexing
- Combine conditionals with indexing
- Return a series object full of booleans
    - `df['column_name']conditional`   # conditional something like `< value`
        - can supply whatever conditional you choose
- Can use the series object as the index to return a dataframe that only selects 'True' values
    - Use directly as the index
        - `df[df['column_name'] < value]`
    - Store in a variable first
        - `bool_index = df['column_name'] = value`  # returns a series object
        - `df_new = df[bool_index]` # returns a dataframe
- Logical operators
    - Use numpy logical and/logical or
    - `bool_index = np.logical_and(df['column_name'] > 8, df['column_name'] < 20)`  # returns a series object
    - `df_new = df[bool_index]`  # returns a dataframe
    - Use `np.logical_or` the same way

In [4]:
# create a dataframe from a dictionary
colors = ['Brown', 'Green', 'Black', 'Yellow']
spanish = ['Cafe', 'Verde', 'Negro', 'Amarillo']
length_colors = [5, 5, 5, 6]
length_spanish = [4, 5, 5, 8]
dict1 = {'Colors': colors, 'Spanish': spanish, 'Length_Colors': length_colors, 'Length_Spanish': length_spanish}
df = pd.DataFrame(dict1)
print(df)
print()

# select all columns and rows that have spanish != 5
bool_index = df['Length_Spanish'] != 5
print(df[bool_index])
print()

# select both length cols where Length_Colors is 5 or more and Length_Spanish is 5 or more
bool_index = np.logical_and(df['Length_Colors'] >= 5, df['Length_Spanish'] >= 5)
df_subset = df[bool_index]
print(df_subset.loc[:, ['Length_Colors', 'Length_Spanish']])

   Colors   Spanish  Length_Colors  Length_Spanish
0   Brown      Cafe              5               4
1   Green     Verde              5               5
2   Black     Negro              5               5
3  Yellow  Amarillo              6               8

   Colors   Spanish  Length_Colors  Length_Spanish
0   Brown      Cafe              5               4
3  Yellow  Amarillo              6               8

   Length_Colors  Length_Spanish
1              5               5
2              5               5
3              6               8


## Looping Through Dataframes

- Column Labels
    - `for i in df: statements`
- Rows  # need to use `.iterrows()` method
    - `for index, row in df.iterrows(): statements`
        - `index` refers to the row labels
        - `row` refers to a series object including col name and values

In [5]:
# using the df created above
# note that there is no col label for the row labels

for i in df:
    print(i)

Colors
Spanish
Length_Colors
Length_Spanish


In [15]:
# access columns
# .ljust(width) helps with alignment and spacing
for index, row in df.iterrows():
    print(('Row: ' + str(index)).ljust(10), ('Spanish: ' + str(row[1])).ljust(20), 'Spanish Length: ' + str(row[3]))

Row: 0     Spanish: Cafe        Spanish Length: 4
Row: 1     Spanish: Verde       Spanish Length: 5
Row: 2     Spanish: Negro       Spanish Length: 5
Row: 3     Spanish: Amarillo    Spanish Length: 8


In [7]:
# loop through a column's values
for index, row in df.iterrows():
    print(row[3] + 10)

14
15
15
18


#### Creating a New Column Based on a Calculation
- Using `.iterrows()` can do the same thing, but it's less efficient and best on small dataframes
- `df['new_column'] = df['column'].apply(function)`
    - supply a new column name and a function to use

In [8]:
# create a new column from the calculation above
df['color_lower'] = df['Colors'].apply(str.lower)
print(df['color_lower'])
print()

# apply a numeric function to a column to create a new column
def add_ten(x):
    return x + 10

df['length_plus_ten'] = df['Length_Colors'].apply(add_ten)
print(df['length_plus_ten'])

0     brown
1     green
2     black
3    yellow
Name: color_lower, dtype: object

0    15
1    15
2    15
3    16
Name: length_plus_ten, dtype: int64


## Working With File Types

#### Importing Flat Files
- Such as csv and txt files with rows and cols
- `dataframe = pd.read_csv(filename[, sep][, comment][, na_values][, nrows][, header])`
    - `sep` is pandas version of delimiter with default `','`
    - `comment` takes the char that comments appear after (for python it's '#')
    - `na_values` takes a list of strings to identify as NA or NaN
    - `nrows` specifies an integer for the number of rows to retrieve
    - `header=None` if no header
    - view the header and first 5 lines of the dataframe with `.head()` method `dataframe.head()`

#### Iterating Through Large Files
- Simple example using chunking to record each unique value and it's number of occurrences
    - Initialize empty dictionary
        - `dict1 = {}`
    - Iterate over the file
          `for chunk in pd.read_csv(filevariable, chunksize=100):`
              `# iterate over a column in the file`
              `for entry in chunk['col_name']:`
                  `if entry in dict1.keys():`
                      `dict1[entry] += 1`
                  `else:`
                      `dict1[entry] = 1`
    - Convert to a dataframe
        - `df = pd.DataFrame(dict1)`


#### Complex Example
- Use a reader object to read the files a specific number of lines at a time
    - `file_name_reader = pd.read_csv('filename', chunksize=num)`
        - common to store filename in a variable and use the variable
        - *num* is the number of lines to read, 1000 is a good number
- Initialize empty df
    - `data = pd.DataFrame()`
- Iterate over each chunk
    - `for grp in file_name_reader:` 
          `filtered_data = grp[grp['col_of_interest'] == condition]`
          # exclude all data not meeting condition
- Zip any columns you want
    - `data_zip = zip(filtered_data['col_name1'], filtered_data['col_name2])`
- Convert zip object to a list
    - `data_list = list(data_zip)`
- Create new dataframe column (this example does a calculation on the two columns to get a %)
    - use list comprehension if needed to create your new column
        - `filtered_data['new_column'] = [int(tup[0] * tup[1] * 0.01) for tup in data_list]`
- Append this 'chunk' to the dataframe
    - `data = data.append(filtered_data)
- Can nest this entire thing in a function to call by supplying relatively few parameters
    - DataCamp example similar to this in old `Python Library.docx` file

#### Excel Files
- `datafile = pd.ExcelFile('filename')`
- view different sheets in the file/dataframe
    - `print(datafile.sheet_names)`
    - use `.sheet_names` attribute of this object
- extract a sheet into a dataframe
    - `dataframe = datafile.parse(sheet[, skiprows][, names][, usecols])`
        - `sheet` supply sheet name as a str or index as float (0 indexed)
        - the following args must be in list format
            - `[arg]` if only supplying one value
        - `skiprows` supply a list of rows to skip (0 indexed)
        - `names` supply a list of names for your imported columns
        - `usecols` supply a list of columns to import (0 indexed)
- Read Excel file and store each sheet as a dataframe with sheet names as the keys to each individual dataframe
    - `df = pd.read_excel('filename', sheetname=none)`
        - can specify a 'sheet' or if sheet='none' will save all sheets using sheet names as keys
        - can use a 'url' as the 'filename' to scrape data from the web

#### SAS and Stata Files
- SAS Files
    - `from sas7bdat import SAS7BDAT`
    - `with SAS7BDAT('filename.sas7bdat') as file:`
          `dfsas = file.to_data_frame()`
- Stata Files
    `data = pd.read_stata('filename.dta')`

#### HDF5 Files
- HDF5 is becoming the industry standard for big data sets
- hierachy of key values, where a value here then becomes a key
- `import h5py`
  `filename = filename.hdf5`
  `data = h5py.File(filename, 'r')`
- exploring data structure
    - `for key in data.keys():`
      `print(key)`
    - provides keys that can be accessed such as 'meta' for metadata
    - access its contents
        - `for key in data['meta'].keys():`
          `print(key)` returns another key in this example 'Description'
    - accessing values
        - `data['meta']['Description'].value`

#### Scraping Data fro the Web
- Some functionality using the 'urllib' package
    - `from urllib.request import urlretrieve, urlopen, Request`
    - import not necessary when using some of the functions below
- Import data into a dataframe using a url
    - `url = 'http://....filename.csv'`
    - `df = pd.read_csv(url, sep=';')` using the appropriate separator (delimiter)
    - `df = pd.read_excel(url, sheetname=none)` 

## Working with Databases
- Need to import the appropriate package
    - `from sqlalchemy import create_engine`
- Creating an engine (sqlalchemy package)
    - `engine = create_engine('sqlite:///db_name.sqlite')`
        - above syntax `'db_type:///db_name.extension'`
- Running a query using Pandas
    - `df = pd.read_sql_query("SELECT * FROM table_name", engine)
    - `engine` is the engine to connect to (see above)