# Five-Step Process for Data Exploration

## Overview

A major pain point for beginners is writing too many lines of code in a single cell. When you are learning, you need to get feedback on every single line of code that you write and verify that it is in fact correct. Only once you have verified the result should you move on to the next line of code.

To help increase your ability to do data exploration in Jupyter Notebooks, I recommend the following five-step process:

1. Write and execute a single line of code to explore your data
1. Verify that this line of code works by inspecting the output
1. Assign the result to a variable
1. Within the same cell, in a second line output the head of the DataFrame or Series
1. Continue to the next cell. Do not add more lines of code to the cell

### Apply to every part of the analysis
You can apply this process to every part of your data analysis. Let's see this process in action with a few examples. We will start by reading in the data.

In [1]:
import pandas as pd

### Step 1: Write and execute a single line of code to explore your data

In this step, we make a call to the `read_csv` function.

In [2]:
pd.read_csv('../data/bikes.csv')

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.00,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.00,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.00,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.00,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.00,partlycloudy
5,13595,Subscriber,Male,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,41.854016,-87.695445,15.0,Clark St & Wrightwood Ave,41.929546,-87.643118,15.0,73.0,10.0,17.3,-9999.00,mostlycloudy
6,18880,Subscriber,Male,2013-07-02 17:47:00,2013-07-02 17:56:00,565,Clark St & Randolph St,41.884576,-87.631890,31.0,Ravenswood Ave & Irving Park Rd,41.954690,-87.673930,19.0,66.0,10.0,15.0,-9999.00,cloudy
7,19689,Subscriber,Male,2013-07-03 09:07:00,2013-07-03 09:16:00,505,State St & Van Buren St,41.877181,-87.627844,27.0,Franklin St & Jackson Blvd,41.877708,-87.635321,27.0,64.0,7.0,5.8,-9999.00,cloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wood St & Division St,41.903320,-87.672730,15.0,71.1,8.0,0.0,-9999.00,cloudy
9,23558,Subscriber,Female,2013-07-04 15:00:00,2013-07-04 15:16:00,922,Lakeview Ave & Fullerton Pkwy,41.925858,-87.638973,19.0,Racine Ave & Congress Pkwy,41.874640,-87.657030,19.0,81.0,10.0,12.7,-9999.00,mostlycloudy


### Step 2: Verify that this line of code works by inspecting the output

Looking above, the output appears to be correct. Of course, we can't inspect every single value, but we can do a sanity check to see if indeed a reasonable-looking DataFrame is produced.

### Step 3: Assign the result to a variable

You would normally do this step in the same cell, but for this demonstration, we will place it in the cell below.

In [3]:
bikes = pd.read_csv('../data/bikes.csv')

### Step 4: Within the same cell, in a second line output the head of the DataFrame or Series

Again, all these steps would be combined in the same cell.

In [4]:
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


### Step 5: Continue to the next cell. Do not add more lines of code to the cell

It is tempting to do more analysis in a single cell. I advise against doing so when you are a beginner. By limiting your analysis to a single line per cell, and outputting that result, you can easily trace your work from one step to the next. Most lines of code in a notebook will apply some operation to the data. It is vital that you can see exactly what this operation is doing. If you put multiple lines of code in a single cell, you lose track of what is happening and can't easily determine the veracity of each operation.

### More examples

Let's see another simple example of the five-step process for data exploration in the notebook. Instead of writing each of the five steps in their own cell, the final result is shown with an explanation that follows.

In [5]:
bikes = bikes.set_index('trip_id')
bikes.head()

Unnamed: 0_level_0,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In this part of the analysis, we want to set one of the columns as the index. During step 1, we write a single line of code, `bikes.set_index('trip_id')`. In step 2, we manually verify that the output looks correct. In step 3, we assign the result to a variable with `bikes = bikes.set_index('trip_id')`. In step 4, we output the head as another line of code, and in step 5, we move on to the next cell.

### No strict requirement for one line of code
The above examples each had a single main line of code followed by outputting the head of the DataFrame. Often times there will be a few more very simple lines of code that can be written in the same cell. You should not strictly adhere to writing a single line of code, but instead, think about keeping the amount of code written in a single cell to a minimum.

For instance, the following block has three lines of code. The first is very simple and creates a list of column names as strings. This is an instance where multiple lines of code are easily interpreted.

In [6]:
cols = ['gender', 'tripduration']
bikes_gt = bikes[cols]
bikes_gt.head()

Unnamed: 0_level_0,gender,tripduration
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1
7147,Male,993
7524,Male,623
10927,Male,1040
12907,Male,667
13168,Male,130


### When to assign the result to a variable
Not all operations on our data will need to be assigned to a variable. We might just be interested in seeing the results. But, for many operations, you will want to continue with the new transformed data. By assigning the result to a variable, you have immediate access to the previous result.

### When to create a new variable name
In the second example, `bikes` was reassigned to itself. We did this because we no longer needed the original DataFrame. In the third example, we created an entirely new variable, `bikes_gt`. This was done because we wanted to keep the `bikes` DataFrame. Creating new variables also makes it easier to trace the flow of work. Debugging is easier as well since we will have preserved the result of the cell in its own variable (assuming we did not overwrite it in a later cell).

### Continuously verifying results
Regardless of how adept you become at doing data explorations, it is good practice to verify each line of code. Data science is difficult and it is easy to make mistakes. Data is also messy and it is good to be skeptical while proceeding through an analysis. Getting visual verification that each line of code is producing the desired result is important. Doing this also provides feedback to help you think about what avenues to explore next.