# **Pandas Help: Functions & Methods**

You will use the pandas documentation often. pandas.pydata.org/pandas-docs/stable/reference

Documentation on Pandas Funtions can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html). 

Documentation on Pandas DataFrame Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). 

Documentation on Pandas Series Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html). 

### Return valuable information about a pandas DataFrame.

>df.shape

>df.info()

>df.describe()

### Create subsets of data from a pandas DataFrame.

>df[bool_series]

>.loc[row_indexer, col_indexer]

>.iloc[row_indexer, col_indexer]

### Create and manipulate columns in a pandas DataFrame.

>.drop(columns=)

>.rename(columns={'original_name': 'new_name')

>.assign(new_column = some_expression_or_calculation)

### Sort the data in a pandas DataFrame.

>.sort_values(by=column(s))

>.sort_index(ascending=True)

### Chain methods to perform more complicated data manipulations.

>df.drop().rename()

- **.sort_values()** returns a sorted copy of a given DataFrame unless inplace=True.

- Default arguments for .sort_values()

    - df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

- **.loc** selects columns and rows using column labels instead of index position. With this attribute, the indexing IS inclusive.
- Default arguments for .loc
    - df.loc[row_indexer, column_indexer]


- **.iloc** accesses a group of rows and columns by their integer location or position. The indexing is NOT inclusive. 
- Default arguments for .iloc
    - df.iloc[row_indexer, column_indexer]

- **.dtypes** returns the data type of each column.


- **.shape** returns number of (rows, columns) in the dataframe (aka a tuple)

- **.columns** returns the list of column names and/or can assign new values to this attribute

- **.index** returns the labels for each row(usually an autogenerated number)

- **.info()** prints out useful info about dataframe (See total number of rows, column names, number of non-null values for each column, datatype of each column, size of the dataframe (memory usage))


- **.describe()** provides us with the descriptive/summary statistics for all columns with numeric dtypes; gives a quick summary of the numerical values in a dataframe.

- **.head()** for the first n (default 5) rows


- **.tail()** for the last n (default 5) rows


- **.sample()** for a random sample of rows

- **.sample(n, random_state=int)** sample n rows
- **.sample(frac, random_state=int)** sample frac (proportion) of rows

- **.drop(columns=[])** drops a column.  the original dataframe is not changed, but instead a new dataframe is produced. However, you can use the inplace argument to change the original dataframe.
    - or Drop columns by not selecting to avoid errors  


- **.rename()** takes in a dictionary with the key as the original name and the value as the new name.

    - If you want to change your original DataFrame to reflect your new column names, either assign to a variable or set inplace=True
    - Another way to rename columns, especially if you want to change many at once:
        - Use the .columns attribute to grab column labels; go a step further and print out a list of the current columns in the DataFrame by adding the .tolist() method. (not necessary if I'm just grabbing the column names, puts out a nice list.)
        - Then, make any changes to the names in the list and reassign them to df.columns.

- **.assign()** creates new columns and is useful for chaining methods together or to assign multiple columns

- **.nsmallest(n, columns, keep='first'/'all'/'last')** returns the first/all/or last n rows ordered by columns in ascending order.

- Returns the first/all/or last n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

- This method is equivalent to **df.sort_values(columns, ascending=True).head(n)**, but more performant.

- Parameters:
    - n: int
        - Number of items to retrieve.

    - columns: *list* or *str*
        - Column name or names to order by.

    - keep{‘first’, ‘last’, ‘all’}, default ‘first’
        - Where there are duplicate values:

            - first : take the first occurrence.

            - last : take the last occurrence.

            - all : do not drop any duplicates, even it means selecting more than n items.


- **nlargest(n, columns, keep='first'/'all'/'last')** returns the first/all/last n rows ordered by columns in descending order.

- Returns the first/all/last n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

- This method is equivalent to **df.sort_values(columns, ascending=False).head(n)**, but more performant.

- Parameters
    - n:int
        - Number of rows to return.

    - columns: *label* or *list of labels*
        - Column label(s) to order by.

    - keep{‘first’, ‘last’, ‘all’}, default ‘first’
        - Where there are duplicate values:

            - first : prioritize the first occurrence(s)

            - last : prioritize the last occurrence(s)

            - **all**: do not drop any duplicates, even it means selecting more than n items.

### What does axis in pandas mean?

- Axis 0 will act on all the ROWS in each COLUMN
    - axis=0 means along "indexes". It's a row-wise operation.
    - Ex: a mean on axis 0 will be the mean of all the rows in each column
    - axis = 0: by column = column-wise = along the rows
    - axis 0 will act for each column on all the rows


- Axis 1 will act on all the COLUMNS in each ROW
    - axis=1 means along "columns". It's a column-wise operation.
    - Ex: a mean on axis 1 will be a mean of all the columns in each row.
    - axis = 1: by row = row-wise = along the columns
    - axis 1 will act for each row on all the columns

# Subset/Filter Dataframes

## Columns

### Return a dataframe

- **df[[col1, col2]]**
    - df[['team', 'wins']]

- **df[[col1]]**
    - df [['team']]
    
- mycols = [col1, col2] -> df[mycols]
    - Ex: my_cols = ['wins', 'losses']
    -     df[my_cols].head()

### Return a series

- **df[col1]**
    - Ex: df['team']

- **df.col1**
    - Ex: df.team

## Rows
- You can subset a dataframe by filtering rows using a conditional.
    - Ex: df[df.col1 < x] will return all columns and all rows where col1 value is less than x
    
## Subset Columns and Filter Rows

- **df[df.col1 < x].col2**: column 2 and rows where col1 value is less than x. What kind of object is returned? *Series*

- **df[df.col1 < x][[col1, col2]]**: columns 1 & 2 and rows where col1 value is less than x. What kind of object is returned? *Dataframe*