# Pandas Essentials:  Data Wrangling

The notebook exercises below provide practice in basic data wrangling. We focus on the NYC Pizza Inspection Data Set.

In [1]:
import pandas as pd

# Set max display columns and rows
pd.options.display.max_columns = 10
pd.options.display.max_rows = 10

# Loading the "Corrupted" NYC Pizza Inspection Data Set

In the cell below, load the "NYC_Pizza_2017_corrupted.csv" data set into a new data frame called `pizza_df`.  This is a file that I have manually corrupted to illustrate a number of key concepts.

In [2]:
pizza_df = pd.read_csv("../data/NYC_Pizza_2017_corrupted.csv")

In the cell below, take a peek at the `pizza_df` data frame.

In [3]:
pizza_df.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,SCORE,GRADE,GRADE DATE
0,40363644.0,DOMINO'S,MANHATTAN,464.0,3 AVENUE,10016.0,Pizza,4.0,A,2017-03-30
1,40363945.0,DOMINO'S,MANHATTAN,148.0,WEST 72 STREET,10023.0,Pizza,12.0,A,2017-03-02
2,40364920.0,RIZZO'S FINE PIZZA,QUEENS,3013.0,STEINWAY STREET,11103.0,Pizza,12.0,A,2016-11-03
3,,,,,,,,,,
4,40365280.0,COMO PIZZA,MANHATTAN,4035.0,BROADWAY,10032.0,Pizza,10.0,A,2016-08-29


In the cell below, output the data types associated with all columns in the `pizza_df` data frame.

In [4]:
pizza_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 10 columns):
CAMIS                  1149 non-null float64
DBA                    1149 non-null object
BORO                   1147 non-null object
BUILDING               1149 non-null object
STREET                 1149 non-null object
ZIPCODE                1149 non-null float64
CUISINE DESCRIPTION    1149 non-null object
SCORE                  1147 non-null float64
GRADE                  1147 non-null object
GRADE DATE             1149 non-null object
dtypes: float64(3), object(7)
memory usage: 90.0+ KB


# Transforming Data Types

In the cell below, transform the GRADE DATE column to a DateTime object.

In [5]:
pizza_df["GRADE DATE"] = pd.to_datetime(pizza_df["GRADE DATE"])

The CAMIS data column currently contains values such as:

* 40363644.0
* 40363945.0

In the cell below, transform the CAMIS column to a String, and remove the trailing .0's from each string.  Your output should look like:

* 40363644
* 40363945

Hint:  you will need to call `astype()` followed by `str.replace`.

In [6]:
pizza_df["CAMIS"] = pizza_df["CAMIS"].astype(str).str.replace("\.0", "")

In the cell below, transform the ZIPCODE column to a String, and remove the trailing .0's from each string.

In [7]:
pizza_df["ZIPCODE"] = pizza_df["ZIPCODE"].astype(str).str.replace("\.0", "")

In the cell below, output the data types again, to verify that your changes have taken effect.

In [8]:
pizza_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 10 columns):
CAMIS                  1151 non-null object
DBA                    1149 non-null object
BORO                   1147 non-null object
BUILDING               1149 non-null object
STREET                 1149 non-null object
ZIPCODE                1151 non-null object
CUISINE DESCRIPTION    1149 non-null object
SCORE                  1147 non-null float64
GRADE                  1147 non-null object
GRADE DATE             1149 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 90.0+ KB


In the cell below, peak at your data frame to verify that the changes have taken effect.

In [9]:
pizza_df.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,SCORE,GRADE,GRADE DATE
0,40363644.0,DOMINO'S,MANHATTAN,464.0,3 AVENUE,10016.0,Pizza,4.0,A,2017-03-30
1,40363945.0,DOMINO'S,MANHATTAN,148.0,WEST 72 STREET,10023.0,Pizza,12.0,A,2017-03-02
2,40364920.0,RIZZO'S FINE PIZZA,QUEENS,3013.0,STEINWAY STREET,11103.0,Pizza,12.0,A,2016-11-03
3,,,,,,,,,,NaT
4,40365280.0,COMO PIZZA,MANHATTAN,4035.0,BROADWAY,10032.0,Pizza,10.0,A,2016-08-29


# Dropping Columns

In the cell below, drop the BUILDING and STREET columns.

In [10]:
pizza_df.drop(labels=["BUILDING", "STREET"], axis="columns", inplace=True)

# Renaming columns

In the cell below, change all spaces to underscore in all column names.

In [22]:
pizza_df.columns = pizza_df.columns.str.replace(" ", "_")

In the cell below, verify that your columns name changes have taken effect.

In [21]:
pizza_df.columns

Index([u'CAMIS', u'DBA', u'BORO', u'ZIPCODE', u'CUISINE_DESCRIPTION', u'SCORE',
       u'GRADE', u'GRADE_DATE'],
      dtype='object')

# Dealing with NA Values

Using one line of code, determine the number of NA values in each column.  [You should see, for example that BORO is missing in 4 records].

In [12]:
pizza_df.isnull().sum()

CAMIS                  0
DBA                    2
BORO                   4
ZIPCODE                0
CUISINE_DESCRIPTION    2
SCORE                  4
GRADE                  4
GRADE_DATE             2
dtype: int64

Using one line of code, output all rows where BORO is missing.

In [13]:
pizza_df[pizza_df.BORO.isnull()]

Unnamed: 0,CAMIS,DBA,BORO,ZIPCODE,CUISINE_DESCRIPTION,SCORE,GRADE,GRADE_DATE
3,,,,,,,,NaT
9,,,,,,,,NaT
11,40369482.0,ARMANDO'S PIZZA,,11239.0,Pizza,13.0,A,2017-03-17
12,40372631.0,YANKEE JZ PIZZA,,10472.0,Pizza,12.0,A,2016-12-20


How many rows are in your data frame?

In [23]:
pizza_df.shape[0]

1142

Using one line of code, drop all rows that contain 1 or more NAs.

In [15]:
pizza_df.dropna(how="any", inplace=True)

How big is your data frame now?

In [24]:
pizza_df.shape[0]

1142

# Dealing with Duplicate Records

Does your data frame contain any duplicate records? [Answer = Yes, 1 duplicate record]

In [17]:
pizza_df.duplicated().value_counts()

False    1142
True        1
dtype: int64

Using one line of code, drop all duplicate records.

In [19]:
pizza_df.drop_duplicates(inplace=True)