# Explore Your Data
## Loading and viewing your data
In this chapter, you're going to look at a subset of the Department of Buildings Job Application Filings dataset from the [NYC Open Data](http://opendata.cityofnewyork.us/) portal. This dataset consists of job applications filed on January 22, 2017.

Your first task is to load this dataset into a DataFrame and then inspect it using the `.head()` and `.tail()` methods. However, you'll find out very quickly that the printed results don't allow you to see everything you need, since there are too many columns. Therefore, you need to look at the data in another way.

The `.shape` and `.columns` attributes let you see the shape of the DataFrame and obtain a list of its columns. From here, you can see which columns are relevant to the questions you'd like to ask of the data. To this end, a new DataFrame, `df_subset`, consisting only of these relevant columns, has been pre-loaded. This is the DataFrame you'll work with in the rest of the chapter.

Get acquainted with the dataset now by exploring it with pandas! This initial exploratory analysis is a crucial first step of data cleaning.

**Instructions**
* Import `pandas` as `pd`.
* Read `'dob_job_application_filings_subset.csv'` ([data source](https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/data)) into a DataFrame called `df`.
* Print the head and tail of `df`.
* Print the shape of `df` and its columns. Note: `.shape` and `.columns` are attributes, not methods, so you don't need to follow these with parentheses `()`.

In [8]:
# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv("dob_job_application_filings_subset.csv")
df_subset = pd.read_csv("dob_job_application_filings_subset_subset.csv")

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

# Print the head and tail of df_subset
print(df_subset.head())
print(df_subset.tail())

       Job #  Doc #        Borough       House #  \
0  121577873      2      MANHATTAN  386            
1  520129502      1  STATEN ISLAND  107            
2  121601560      1      MANHATTAN  63             
3  121601203      1      MANHATTAN  48             
4  121601338      1      MANHATTAN  45             

                        Street Name  Block  Lot    Bin # Job Type Job Status  \
0  PARK AVENUE SOUTH                   857   38  1016890       A2          D   
1  KNOX PLACE                          342    1  5161350       A3          A   
2  WEST 131 STREET                    1729    9  1053831       A2          Q   
3  WEST 25TH STREET                    826   69  1015610       A2          D   
4  WEST 29 STREET                      831    7  1015754       A3          D   

            ...                         Owner's Last Name  \
0           ...            MIGLIORE                         
1           ...            BLUMENBERG                       
2           ...        

  interactivity=interactivity, compiler=compiler, result=result)


## Further diagnosis
In the previous exercise, you identified some potentially unclean or missing data. Now, you'll continue to diagnose your data with the very useful `.info()` method.

The `.info()` method provides important information about a DataFrame, such as the number of rows, number of columns, number of non-missing values in each column, and the data type stored in each column. This is the kind of information that will allow you to confirm whether the `'Initial Cost'` and `'Total Est. Fee'` columns are numeric or strings. From the results, you'll also be able to see whether or not all columns have complete data in them.

The full DataFrame df and the subset DataFrame df_subset have been pre-loaded. Your task is to use the `.info()` method on these and analyze the results.

**Instructions**
* Print the info of *df*.
* Print the info of the subset dataframe, *df_subset*

In [9]:
# Print the info of df
print(df.info())

# Print the info of df_subset
print(df_subset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13280 entries, 0 to 13279
Data columns (total 82 columns):
Job #                           13280 non-null int64
Doc #                           13280 non-null int64
Borough                         13280 non-null object
House #                         13280 non-null object
Street Name                     13280 non-null object
Block                           13280 non-null int64
Lot                             13280 non-null int64
Bin #                           13280 non-null int64
Job Type                        13280 non-null object
Job Status                      13280 non-null object
Job Status Descrp               13280 non-null object
Latest Action Date              13280 non-null object
Building Type                   13280 non-null object
Community - Board               13280 non-null object
Cluster                         0 non-null float64
Landmarked                      2116 non-null object
Adult Estab                     1 no