<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Introduction to Pandas


*Instructor: Aymeric Flaisler*
___

Pandas is the most popular Python package for managing data sets. It's used extensively by data scientists.

### Learning Objectives

- Define the anatomy of DataFrames.
- Explore data with DataFrames.
- Practice plotting with pandas.

### Lesson Guide

- [Introduction to `pandas`](#introduction)
- [Loading CSV Files](#loading_csvs)
- [Exploring Your Data](#exploring_data)
- [Data Dimensions](#data_dimensions)
- [DataFrames vs. Series](#dataframe_series)
- [Using the `.info()` Function](#info)
- [Using the `.describe()` Function](#describe)
- [Independent Practice](#independent_practice)
- [Pandas Indexing](#indexing)
- [Creating DataFrames](#creating_dataframes)
- [Checking Data Types](#dtypes)
- [Renaming and Assignment](#renaming_assignment)
- [Basic `pandas` Plotting](#basic_plotting)
- [Logical Filtering](#filtering)
- [Review](#review)

<a id='introduction'></a>

### What is a dataframe?

---
The concept of a "dataframe" comes from the world of statistical software used in empirical research; 
- Generally refers to "tabular" data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns)
- **Each row** is treated as a **single observation** of **multiple "variables"** 
- The row ("record") datatype can be **heterogenous** (a tuple of different types) 
- The column datatype must be **homogenous**. 
- Data frames usually contain some **metadata** in addition to **data**; for example, column and row names (unlike Numpy by default)

<a id='introduction'></a>

### What is `pandas`?

---

- A data analysis library — **P**anel **D**ata **S**ystem.
- It was created by Wes McKinney and open sourced by AQR Capital Management, LLC in 2009.
- It's implemented in highly optimized Python/Cython.
- It's the **most ubiquitous tool** used to start data analysis projects within the Python scientific ecosystem.


### Pandas Use Cases

---

- Cleaning data/munging.
- Exploratory analysis.
- Structuring data for plots or tabular display.
- Joining disparate sources.
- Modeling.
- Filtering, extracting, or transforming. 


![](https://snag.gy/tpiLCH.jpg)

![](https://snag.gy/1V0Ol4.jpg)

### Common Outputs

---

With `pandas` you can:

- Export to databases
- Integrate with `matplotlib`
- Collaborate in common formats (plus a variety of others)
- Integrate with Python built-ins (**and `numpy`!**)


### Importing `pandas`

---

Import `pandas` at the top of your notebook like so:

In [2]:
import pandas as pd
import numpy as np

Recall that the **`import pandas as pd`** syntax nicknames the `pandas` module as **`pd`** for convenience.

<a id='loading_csvs'></a>

### Loading a CSV into a DataFrame

---

`pandas` can load many types of files, but one of the most commonly used for storing data is a ```.csv```. As an example, let's load a data set on drug use by age from the ```./datasets``` directory:

In [3]:
drug = pd.read_csv('./datasets/drug-use-by-age.csv')

This creates a `pandas` object called a **DataFrame**. DataFrames are powerful containers, featuring many built-in functions for exploring and manipulating data.

We will barely scratch the surface of DataFrame functionality in this lesson, but, throughout this course, you will become an expert at using them.

In short, a dataframe is a supercharged 2D array:
    - it has the data
    - it has information about it (meta-data - like columns names, etc...)

<a id='exploring_data'></a>

### Exploring Data using DataFrames

---

DataFrames come with built-in functionality that makes data exploration easy. 

To start, let's look at the **"header"** of your data using the ```.head()``` function. If run alone in a notebook cell, it will show you the first handful of columns in the data set, along with the first five rows.

In [4]:
drug.head(10)

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
3,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
4,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0
5,17,3038,49.3,13.0,28.0,36.0,2.0,5.0,0.1,21.0,...,1.4,6.0,3.5,7.0,2.8,9.0,0.6,48.0,0.5,6.5
6,18,2469,58.7,24.0,33.7,52.0,3.2,5.0,0.4,10.0,...,1.7,7.0,4.9,12.0,3.0,8.0,0.5,12.0,0.4,10.0
7,19,2223,64.6,36.0,33.4,60.0,4.1,5.5,0.5,2.0,...,1.5,7.5,4.2,4.5,3.3,6.0,0.4,105.0,0.3,6.0
8,20,2271,69.7,48.0,34.0,60.0,4.9,8.0,0.6,5.0,...,1.7,12.0,5.4,10.0,4.0,12.0,0.9,12.0,0.5,4.0
9,21,2354,83.2,52.0,33.0,52.0,4.8,5.0,0.5,17.0,...,1.3,13.5,3.9,7.0,4.1,10.0,0.6,2.0,0.3,9.0


If we want to see the last part of our data, we can use the ```.tail()``` function equivalently.

In [5]:
drug.tail()

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
12,26-29,2628,80.7,52.0,20.8,52.0,3.2,5.0,0.4,6.0,...,1.2,13.5,4.2,10.0,2.3,7.0,0.6,30.0,0.4,4.0
13,30-34,2864,77.5,52.0,16.4,72.0,2.1,8.0,0.5,15.0,...,0.9,46.0,3.6,8.0,1.4,12.0,0.4,54.0,0.4,10.0
14,35-49,7391,75.0,52.0,10.4,48.0,1.5,15.0,0.5,48.0,...,0.3,12.0,1.9,6.0,0.6,24.0,0.2,104.0,0.3,10.0
15,50-64,3923,67.2,52.0,7.3,52.0,0.9,36.0,0.4,62.0,...,0.4,5.0,1.4,10.0,0.3,24.0,0.2,30.0,0.2,104.0
16,65+,2448,49.3,52.0,1.2,36.0,0.0,-,0.0,-,...,0.0,-,0.2,5.0,0.0,364.0,0.0,-,0.0,15.0


<a id='data_dimensions'></a>

### Data Dimensions

---

It's always good to look at the dimensions of your data. The ```.shape``` property will tell you how many rows and columns are contained within your DataFrame.

In [6]:
drug.shape

(17, 28)

As you can see, we have 17 rows and 28 columns, so we can consider this a small data set.

You'll also notice that this function operates the same as `.shape` for `numpy` arrays/matricies. **`pandas` makes use of **numpy** under its hood** for optimization and speed.

You can look up the names of your columns using the ```.columns``` property.


In [7]:
drug.columns

Index(['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')

Accessing a specific column is easy. You can use **bracket** syntax just like you would with **Python dictionaries**, using the column's string name to extract it.

In [8]:
drug['crack-use'].head()

0    0.0
1    0.0
2    0.0
3    0.1
4    0.0
Name: crack-use, dtype: float64

As you can see, we can also use the ```.head()``` function on a single column, which is represented as a `pandas` Series object.

With a **list of strings**, you can also access a column (as a DataFrame instead of a Series).

In [9]:
drug[['crack-use']].head()

Unnamed: 0,crack-use
0,0.0
1,0.0
2,0.0
3,0.1
4,0.0


In [10]:
drug[['age','crack-use']].head()

Unnamed: 0,age,crack-use
0,12,0.0
1,13,0.0
2,14,0.0
3,15,0.1
4,16,0.0


<a id='dataframe_series'></a>

### DataFrame vs. Series

---

There is an important difference between using a list of strings versus only using a string with a column's name: When you use a list containing the string, it returns another **DataFrame**. But, when you only use the string, it returns a `pandas` **Series** object.

In [11]:
print(type(drug['age']))

<class 'pandas.core.series.Series'>


In [12]:
print(type(drug[['age']]))

<class 'pandas.core.frame.DataFrame'>


**Breakout (2min):** What's the difference between `pandas` Series and DataFrame objects?

As long as your column names **don't contain any spaces** or other specialized characters (underscores are OK), you can access a column as a property of a DataFrame.  

**Get in the habit of referencing your Series columns using `df['my_column']` rather than with object notation (`df.my_column`)**. There are many edge cases in which the object notation does not work, along with nuances as to how `pandas` will behave.

In [13]:
drug['age'].head()

0    12
1    13
2    14
3    15
4    16
Name: age, dtype: object

Remember: This will be a **Series** object, not a **DataFrame**.

<a id='info'></a>

### Examining Your Data With `.info()`

---

When getting acquainted with a new data set, `.info()` should be **the first thing** you examine.

**Types** are very important. They affect the way data will be **represented** in our machine learning models, how data can be joined, whether or not math operators can be applied, and instances in which you can encounter unexpected results.

> _Typical problems that arise when working with new data sets include_:
> - Missing values.
> - Unexpected types (string/object instead of int/float).
> - Dirty data (commas, dollar signs, unexpected characters, etc.).
> - Blank values that are actually "non-null" or single white-space characters.

`.info()` is a function available on every **DataFrame** object. It provides information about:

- The name of the column/variable attribute.
- The type of index (RangeIndex is default).
- The count of non-null values by column/attribute.
- The type of data contained in the column/attribute.
- The unique counts of **dtypes** (`pandas` data types).
- The memory usage of our data set.

#### For example: 

In [14]:
drug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          17 non-null object
crack-use                  17 non-null float64
crack-frequency            17 non-null object
heroin-use                 17 non-null float64
heroin-frequency           17 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         17 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-f

<a id='describe'></a>

### Summarizing Data with `.describe()`

---

The ```.describe()``` function is useful for taking a quick look at your data. It returns some basic descriptive statistics.

For our example, use the ```.describe()``` function on only the ```crack-use``` column.

In [15]:
drug['crack-use'].describe()

count    17.000000
mean      0.294118
std       0.235772
min       0.000000
25%       0.000000
50%       0.400000
75%       0.500000
max       0.600000
Name: crack-use, dtype: float64

You can also use it on multiple columns, such as ```crack-use``` and ```alcohol-frequency```.

In [16]:
drug.describe()

Unnamed: 0,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,crack-use,heroin-use,hallucinogen-use,hallucinogen-frequency,...,pain-releiver-use,pain-releiver-frequency,oxycontin-use,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,sedative-use,sedative-frequency
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,...,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,3251.058824,55.429412,33.352941,18.923529,42.941176,2.176471,0.294118,0.352941,3.394118,8.411765,...,6.270588,14.705882,0.935294,2.805882,11.735294,1.917647,31.147059,0.382353,0.282353,19.382353
std,1297.890426,26.878866,21.318833,11.959752,18.362566,1.816772,0.235772,0.333762,2.792506,15.000245,...,3.166379,6.935098,0.608216,1.753379,11.485205,1.407673,85.97379,0.262762,0.138,24.833527
min,2223.0,3.9,3.0,1.1,4.0,0.0,0.0,0.0,0.1,2.0,...,0.6,7.0,0.0,0.2,4.5,0.0,2.0,0.0,0.0,3.0
25%,2469.0,40.1,10.0,8.7,30.0,0.5,0.0,0.1,0.6,3.0,...,3.9,12.0,0.4,1.4,6.0,0.6,7.0,0.2,0.2,6.5
50%,2798.0,64.6,48.0,20.8,52.0,2.0,0.4,0.2,3.2,3.0,...,6.2,12.0,1.1,3.5,10.0,1.8,10.0,0.4,0.3,10.0
75%,3058.0,77.5,52.0,28.4,52.0,4.0,0.5,0.6,5.2,4.0,...,9.0,15.0,1.4,4.2,11.0,3.0,12.0,0.6,0.4,17.5
max,7391.0,84.2,52.0,34.0,72.0,4.9,0.6,1.1,8.6,52.0,...,10.0,36.0,1.7,5.4,52.0,4.1,364.0,0.9,0.5,104.0


```.describe()``` gives us the following statistics:

- **Count**, which is equivalent to the number of cells (rows).
- **Mean**, or, the average of the values in the column.
- **Std**, which is the standard deviation.
- **Min**, a.k.a., the minimum value.
- **25%**, or, the 25th percentile of the values.
- **50%**, or, the 50th percentile of the values ( which is the equivalent to the median).
- **75%**, or, the 75th percentile of the values.
- **Max**, which is the maximum value.

<img src="https://snag.gy/AH6E8I.jpg">

There are also built-in math functions that will work on all columns of a DataFrame at once, as well as subsets of the data.

For example, I can use the ```.mean()``` function on the ```drug``` DataFrame to get the mean for every column.

In [17]:
pd.DataFrame(drug.mean(), columns=['mean'])

Unnamed: 0,mean
n,3251.058824
alcohol-use,55.429412
alcohol-frequency,33.352941
marijuana-use,18.923529
marijuana-frequency,42.941176
cocaine-use,2.176471
crack-use,0.294118
heroin-use,0.352941
hallucinogen-use,3.394118
hallucinogen-frequency,8.411765


<a id='independent_practice'></a>

### Independent Practice

---

Now that we know a little bit about basic DataFrame use, let's practice on a new data set.

> Pro tip: When your cursor is in a string, you can use the "tab" key to browse file system resources and get a relative reference for the files that can be loaded in Jupyter notebook. Remember, you have to use your arrow keys to navigate the files populated in the UI. 

<img src="https://snag.gy/IlLNm9.jpg">

1. Find and load the `diamonds` data set into a DataFrame (in the `datasets` directory).
2. Print out the columns.
3. What does the data set look like in terms of dimensions?
4. Check the types of each column.
  a. What is the most common type?
  b. How many entries are there?
  c. How much memory does this data set consume?
5. Examine the summary statistics of the data set.

In [18]:
csv_file = "./datasets/diamonds/diamonds.csv"
diamonds = pd.read_csv(csv_file)

In [20]:
# column names
diamonds.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [None]:
# shape
diamonds.shap

In [None]:
# column data types, number of entries and memory used. seems like alot of info
diamonds.info()

In [None]:
# summary stats 
diamonds.describe()

<a id='indexing'></a>

### `pandas` Indexing 

---

More often than not, we want to operate on or extract specific portions of our data. When we perform indexing on a DataFrame or Series, we can specify a certain section of the data.

`pandas` has three properties you can use for indexing:

- **`.loc`** indexes with the _labels_ for rows and columns.
- **`.iloc`** indexes with the _integer positions_ for rows and columns.

To help clarify these differences, let's first reset the row labels to letters using the ```.index``` attribute:

In [21]:
new_index_values = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q']
drug.index=new_index_values


In [22]:
drug.head()

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
A,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
B,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
C,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
D,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
E,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0


Using the **`.loc`** indexer, we can pull out rows **B through F** and the **`marijuana-use` and `marijuana-frequency`** columns.

In [23]:
subset = drug.loc[['B','C','D','E','F'], ['marijuana-use','marijuana-frequency']]

In [24]:
subset

Unnamed: 0,marijuana-use,marijuana-frequency
B,3.4,15.0
C,8.7,24.0
D,14.5,25.0
E,22.5,30.0
F,28.0,36.0


We can do the same thing with the **`.iloc`** indexer, but we have to use integers for the location.

In [25]:
subset = drug.iloc[[1,2,3,4,5], [4,5]]

In [26]:
subset

Unnamed: 0,marijuana-use,marijuana-frequency
B,3.4,15.0
C,8.7,24.0
D,14.5,25.0
E,22.5,30.0
F,28.0,36.0


If you try to index the rows or columns with integers using **`.loc`**, you will get an error.

Note that you can automatically reorder the data just by reordering the indices you enter when you perform the indexing operation!

While we created an index earlier, we can also use a column to set an index.

In [27]:
drug.index = drug['age']

drug.head()

Unnamed: 0_level_0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
13,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
14,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
15,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
16,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0


Is age the best feature to use as an index?  

If it isn't we can use the `df.reset_index()` to reset our index.

In [None]:
drug.reset_index(drop=True, inplace=True)
drug.head()

<a id='creating_dataframes'></a>

### Creating DataFrames

---

The simplest way to create your own DataFrame without importing data from a file is to give the ```pd.DataFrame()``` instantiator a dictionary.

In [28]:
mydata = pd.DataFrame({'Letters':['A','B','C'], 'Integers':[1,2,3], 'Floats':[2.2, 3.3, 4.4]})

In [29]:
mydata

Unnamed: 0,Letters,Integers,Floats
0,A,1,2.2
1,B,2,3.3
2,C,3,4.4


As you might expect, the dictionary needs to have lists of values that are all the same length. The keys correspond to the names of the columns, and the values correspond to the data in the columns.

<a id='dtypes'></a>

### Examining Data Types

---

`pandas` comes with a useful property for looking solely at the data types of your DataFrame columns. Use ```.dtypes``` on your DataFrame:

In [30]:
mydata.dtypes

Letters      object
Integers      int64
Floats      float64
dtype: object

This will show you the data type of each column. Strings are stored as a type called "object," as they are not guaranteed to take up a set amount of space (strings can be any length).

<a id='renaming_assignment'></a>

### Renaming and Assignment

---

`pandas` makes it easy to change column names and assign values to your DataFrame.

Say, for example, we want to change the column name `Integers` to `int`:

In [31]:
mydata.rename(columns={mydata.columns[1]:'int'}, inplace=True) # inplace = True updates mydata
print(mydata.columns)

Index(['Letters', 'int', 'Floats'], dtype='object')


In [None]:
mydata

If you want to change every column name, you can just assign a new list to the ```.columns``` property.

In [None]:
mydata.columns = ['A','B','C']
print(mydata.head())

<a id='basic_plotting'></a>

### Basic Plotting Using DataFrames

---

DataFrames also come with some basic convenience functions for plotting data. First, import `matplotlib` and set it to run "inline" in your notebook.

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

Using our ```drug``` DataFrame once again, use the ```.plot()``` function to plot the **`age`** columns against the **`marijuana-use`** column.

In [None]:
drug.plot(x='age', y='marijuana-use')

The ```.hist()``` function will create a histogram for a column's values.

In [None]:
drug.hist('marijuana-use')

<a id='filtering'></a>

### Filtering Logic

---

One of the most powerful features of DataFrames is the ability to use logical commands to filter data.

Subset the ```drug``` data for only the rows in which `marijuana-use` is greater than 20.

In [None]:
drug[drug['marijuana-use'] > 20]

The ampersand sign can be used to subset where multiple conditions need to be met for each row. 

Subset the data for `marijuana-use` greater than 20 like before, but now, also include where the n is greater than 4,000.

In [None]:
drug[(drug['marijuana-use'] > 20) & (drug.n > 4000)]

## Independent Practice

With our drug dataset already loaded, let's explore our dataset a bit more thoroughly to gain some familiarity with beginning exploratory analysis.

### 1.  Identify which variable distributions are skewed left or right.

### 2. Select only data for "marijuana-frequency" when "age" is "30-34".

### 2.A Can you select a range of values for age?  Why or why not?
ie:  age > 17 but < 21

### 3. Select only rows with index 5-10, for variables / columns "crack-use" and "crack-frequency"

### 4. Select the columns by numeric offset 2-5, rows with numeric index 3-7

### 5. Select a subset of data using 3 masked conditions.

ie: "variable" > 20 and "variable 2" != 23 and "variable" > -2

<a id='review'></a>

### Review / Checkout (in pair)

---

 - What should we do with a data set when we first acquire it?
 - What's important to consider when first looking at a data set? 
 - What are some common problems we can run into with new data?
 - What are some common operations we can run with DataFrames?
 - How do we slice? Index? Filter?