## What Is Pandas?

Pandas is a Python library that primarily adds two new datatypes to Python: `DataFrame` and `Series`.

- A `Series` is a sequence of items, where each item has a unique label (called an `index`).
  - Generalization of a vector (vectors have integer indexes, series can have any kind of index)
  - More structured and efficient than a dictionary (elements of a series are ordered and optimized for sequential processing) 
- A `DataFrame` is a table of data. Each row has a unique label (the `row index`), and each column has a unique label (the `column index`).
- Note that each column in a `DataFrame` can be considered a `Series` (`Series` index).

> Behind the scenes, these datatypes use the NumPy ("Numerical Python") library. NumPy primarily adds the `ndarray` (n-dimensional array) datatype to Pandas. An `ndarray` is similar to a Python list — it stores ordered data. However, it differs in three respects:
> - Each element has the same datatype (typically fixed-size, e.g., a 32-bit integer).
> - Elements are stored contiguously (immediately after each other) in memory for fast retrieval.
> - The total size of an `ndarray` is fixed.

> Storing `Series` and `DataFrame` data in `ndarray`s makes Pandas faster and uses less memory than standard Python datatypes. Many libraries (such as scikit-learn) accept `ndarray`s as input rather than Pandas datatypes, so we will frequently convert between them.


### Using Pandas

Pandas is frequently used in data science because it offers a large set of commonly used functions, is relatively fast, and has a large community. Because many data science libraries also use NumPy to manipulate data, you can easily transfer data between libraries (as we will often do in this class!).

Pandas also highly favors certain patterns of use. For example, looping through a `DataFrame` row by row is highly discouraged. Instead, Pandas favors using **vectorized functions** that operate column by column. (This is because each column is stored separately as an `ndarray`, and NumPy is optimized for operating on `ndarray`s.)

Almost all libraries for EDA and ML first require loading a data set into a `DataFrame`
  - One row per "observation"
  - One column per "attribute"

### Class Methods and Attributes

Pandas `DataFrame`s are Pandas class objects and therefore come with attributes and methods. To access these, follow the variable name with a dot. For example, given a `DataFrame` called `users`:

  - users.index -- accesses the `index` attribute -- note there are no parentheses. attributes are not callable
  - users.head() -- calls the `head` method (since there are open/closed parentheses)
  - users.head(10) -- calls the `head` method with parameter `10`, indicating the first 10 rows. this is the same as:
  - users.head(n=10) -- calls the `head` method, setting the named parameter `n` to have a value of `10`.

In [1]:
# Load Pandas into Python


In [3]:
# Use the read_table function to load the file 'user.tbl' as a data frame
# Store it in a variable users



In [None]:
# What is the type of this object

In [None]:
# How does it print?

In [None]:
# What is its size and shape?

In [None]:
# Display some rows from the beginning and end of the data frame

In [None]:
# What are the row and column indexes?

In [None]:
# Use the info method to get a concise summary of the data frame
# Use the describe method to get a different summary

In [None]:
# Retrieve a column in two ways:  bracket notation and attribute notation
# What is the data type of a column?

In [None]:
# Do vector operations on a column:  min, max, mean

In [None]:
# Get a count of each distinct data value for the attribute gender

In [None]:
# Use a bar chart to show the counts for gender

In [None]:
# Use value_counts to get counts on a numeric attribute -- age

In [None]:
# Maybe a histogram would work better for numeric data

In [None]:
# Create a data frame containing only the users under the age of 20
#  (both verbosely, and in one line)

In [None]:
# Get the count by occupation of users under the age of 20

In [None]:
# Filter for all male users under the age of 20

In [None]:
# Filter for all users under 20 or over 60

In [None]:
# Create a data frame containing only columns age and occupation

In [None]:
# Select by occupation:  either 'doctor' or 'lawyer'

In [None]:
# Produce a sorted vector of user ages

In [None]:
# Sort the data frame by age
#   in ascending order

In [None]:
# Sort by occupation then by age within occupation

In [None]:
# Rename the column zip_code to be postal_code

In [None]:
# Create a new column salary; populate it with random numbers 
#   normally distributed with a mean of 100000 and a standard deviation of 1000

In [None]:
# Create the column dollars_per_year which is salary divided by age

In [None]:
# Drop the column dollars_per_year

In [None]:
# Compute average salary by occupation

In [None]:
# Create a new column age_in_months 

In [None]:
# Change the coding of gender to 0 for male, 1 for female
users['is_male'] = users.gender.map({'F':0, 'M':1})

In [None]:
# Replace the occupation 'other' to 'unknown'

In [256]:
# Change the data type of zip_code to integer


In [None]:
# Create a new data frame with a random sample of 50% of the rows in the original

In [None]:
# Write the new data frame to a file user_sample.csv --
#  make it comma delimited, and no header line