In [57]:
import json_lines
import pandas as pd
import numpy as np
import seaborn as sns
import re
import warnings
warnings.filterwarnings("ignore")
from IPython.core import display as ICD

# Introduction

The following notebook manipulates a large 150,000 entry JSONL data file. It does so by first breaking out dictionaries in the name and address fields fields; it converts the dictionary keys into new dataframe columns. It then calculates metrics for the data frame, such as % of occupied fields and unique values within columns. I began by first proving all functions and metrics on a subset dataframe with 100 entries. After successfully completing  a set of tasks on the small dataframe, I'll scale up to the original dataframe. 


In [2]:
# Load data into list, and then load to dataframe
with json_lines.open('../data/ida_wrangling_exercise_data.2017-02-13.jsonl.gz') as f:
    data = [item for item in f]
    df = pd.DataFrame(data)

### Data exploration 

In [4]:
df.head()

Unnamed: 0,address,dob,email,id,name,phone,record_date,ssn
0,"{'street': '86314 David Pass Apt. 211', 'city'...",1971-06-30,opark@hotmail.com,01d68a4c598a45559c06f4df0b3d82cb,"{'firstname': 'Cynthia', 'lastname': 'Dawson',...",624-869-4610,2006-07-08T09:02:13,xxx-xx-2412
1,"20722 Coleman Villages\nEast Rose, SC 71064-5894",1965-09-09,sperez@armstrong.com,876ff718291d4397bb1e0477ceee6ad9,"{'firstname': 'Tamara', 'lastname': 'Myers'}",1-594-462-7759,2009-03-28T20:22:57,xxx-xx-8025
2,"{'street': '6676 Young Square', 'city': 'New J...",1993-04-12,uortiz@gmail.com,81753097bf7e4e2085982f422bdb9cda,"{'firstname': 'Jamie', 'lastname': 'Alexander'}",472.218.5065x389,2016-08-30T20:31:39,xxx-xx-0568
3,"0932 Gomez Drives\nLeefort, MD 46879-3166",1977-04-14,palmerdiane@yahoo.com,2c2f7154b80f40ca80d08c5adc54ea45,"{'firstname': 'Angela', 'lastname': 'Garcia', ...",1-663-109-4460x1080,2001-02-15T18:50:35,xxx-xx-9825
4,"{'street': '158 Smith Vista', 'city': 'East Sh...",1970-03-19,nancymaxwell@gmail.com,4f5263f339694d068e17ee7fdbb852b8,"{'firstname': 'Jennifer', 'lastname': 'Rodrigu...",233-423-3823,2014-06-21T14:36:01,xxx-xx-9104


In [5]:
'Number of rows {} '.format(len(df))

'Number of rows 150000 '

This initial dataframe has 150,000 entries. Some of the fields (e.g. address and name) are recorded as dictionaries and strings. Let's start working with a subset (first 100 entries ) of the data frame. Truncating the dataframe will speed up initial code experimentation. 

### Operating on subset of data 

In [6]:
# Subset wtih first 100 entries of dataframe
df_sub = df[0:100].copy()

## 1. Make a list of all of the nested named fields that appear in any record. Concatenate nested field names using a period '.' to defind named fields for nested records. Present the list in alphabetical order.

In [8]:
def expand_dict(dframe, dict_names):
    """This function will take a dataframe and dictionary list as inputs. 
    It will convert dictionary keys into seperate columns within the same dataframe by invoking the apply method.
    It returns a dataframe with concatenated columns"""

    # Creates an empty dataframe
    df_nested = pd.DataFrame()

    # Iterate through list of dictionaries
    for item in dict_names:
        # Expand dictionary keys into columns
        df_nested = pd.concat([df_nested, dframe[item].apply(pd.Series).add_prefix(
            (item + '.'))], axis=1)
        # Drops extra generated column from apply method
        df_nested.drop(item + '.0', axis=1, inplace=True)
        # Add original nested column
        df_nested[item] = dframe[item]
        # Replace nested fields with NaN, otherwise leave string value
        df_nested[item] = df_nested[item].apply(lambda x: np.nan if type(x) == dict else x)

    return df_nested

In [9]:
#Invoke the function above 
df_sub = expand_dict(df_sub, ['address', 'name'])
df_sub.head()

Unnamed: 0,address.city,address.state,address.street,address.zip,address,name.firstname,name.lastname,name.middlename,name
0,Hoodburgh,RI,86314 David Pass Apt. 211,83973.0,,Cynthia,Dawson,Claire,
1,,,,,"20722 Coleman Villages\nEast Rose, SC 71064-5894",Tamara,Myers,,
2,New Julie,UT,6676 Young Square,73125.0,,Jamie,Alexander,,
3,,,,,"0932 Gomez Drives\nLeefort, MD 46879-3166",Angela,Garcia,Alexis,
4,East Sharonstad,ME,158 Smith Vista,42483.0,,Jennifer,Rodriguez,,


In [10]:
# An alphabetical order list of the fields 
sorted(list(df_sub.columns))

['address',
 'address.city',
 'address.state',
 'address.street',
 'address.zip',
 'name',
 'name.firstname',
 'name.lastname',
 'name.middlename']

## 2. Answer the following questions for each field in your list from question 1.

- What percentage of the records contain the field?
- What are the five most common values of the field?

#### What percentage of the records contain the field?

In [11]:
def col_percent(dframe):
    '''A funciton to loop through columns and return the percentage of populated items.'''
    for item in dframe:
        print('{} = {:.1f}%'.format(item, 100 *
                                    dframe[item].count() / len(dframe)))

In [12]:
# Invoke the function above 
col_percent(df_sub)

address.city = 41.0%
address.state = 41.0%
address.street = 41.0%
address.zip = 41.0%
address = 52.0%
name.firstname = 66.0%
name.lastname = 66.0%
name.middlename = 26.0%
name = 34.0%


#### What are the five most common values of the field?

In [13]:
# Five most common values in a column
def common_values(df, num):
    '''Function takes a dataframe and int as input, returns int number of common values'''
    for item in df:
        # Prints common values as a dataframe
        ICD.display(pd.DataFrame(df[item].value_counts().head(num)) )

In [14]:
common_values(df_sub,5)

Unnamed: 0,address.city
Natalieton,1
Michaelchester,1
West Annaside,1
Chrisview,1
Hoodburgh,1


Unnamed: 0,address.state
UT,4
RI,4
WY,3
HI,2
CT,2


Unnamed: 0,address.street
119 Mullins Pines,1
280 Angela Turnpike Apt. 158,1
190 David Extension,1
561 Cynthia Cliffs Suite 453,1
01737 Hailey Drives Suite 056,1


Unnamed: 0,address.zip
19394-0990,1
73890,1
04527,1
76763-1662,1
64603,1


Unnamed: 0,address
"9524 Danielle Burg Apt. 849\nNorth Vickistad, NY 90135",1
"967 White Shoals\nGabriellechester, MN 17856-8053",1
"353 Saunders Camp Suite 140\nBryanttown, LA 05531-4786",1
"02791 Vargas Gardens\nSouth Victorborough, ND 43425",1
"PSC 9738, Box 3367\nAPO AE 85789",1


Unnamed: 0,name.firstname
Robert,3
Kevin,2
Steven,2
Brandon,2
Eric,2


Unnamed: 0,name.lastname
Davis,3
Chapman,2
Garcia,2
Hall,2
Gonzalez,2


Unnamed: 0,name.middlename
David,2
Gabrielle,1
Bryan,1
Dillon,1
Charles,1


Unnamed: 0,name
Billy Carter,1
Mr. John Brock MD,1
Eric Cain,1
Barbara Harris,1
Alisha Murray,1


## 3. How many distinct first names appear in this data set? Explain your procedure for identifying distinct first names.

Used the unique() method on the 'name.namefirst' column. With more time I would have parsed out the first names strings in the 'name' column. 

In [15]:
df_sub['name.firstname'].unique()

array(['Cynthia', 'Tamara', 'Jamie', 'Angela', 'Jennifer', 'Keith',
       'Shane', 'Charles', nan, 'Amanda', 'Adrian', 'Tiffany', 'Peter',
       'Steven', 'Lindsay', 'Sara', 'Amy', 'Christopher', 'Melissa',
       'Crystal', 'Mallory', 'Elizabeth', 'Richard', 'Jonathan', 'Nicole',
       'Michael', 'Robert', 'Ryan', 'Mary', 'Brooke', 'Daniel', 'Calvin',
       'Brandon', 'Laurie', 'Ashley', 'Brent', 'Kevin', 'Eric', 'Deborah',
       'Christine', 'Joseph', 'Donald', 'Jeff', 'Nathan', 'Martin', 'Lisa',
       'Ricky', 'Casey', 'Stacy', 'Matthew', 'Cassidy', 'Jenna', 'Mark',
       'Brad', 'Stephanie', 'John', 'Lauren', 'Veronica'], dtype=object)

In [144]:
len(df_sub['name.firstname'].unique())

58

## 4. How many distinct street names appear in this data set? Explain your procedure for identifying distinct street names.


In [75]:
# Copy the address.street column to capture street addresses.
df_street = df_sub[['address.street']].copy()
df_street.dropna(inplace=True)
df_street.head()

Unnamed: 0,address.street
0,86314 David Pass Apt. 211
2,6676 Young Square
4,158 Smith Vista
5,461 Knapp Unions
8,3697 Mills Estates Apt. 499


In [72]:
a = df_street['address.street'][0]
re.sub("\d+", "", a.split('Apt', 1)[0])

' David Pass '

In [76]:
# Remove Apt, Suite, numbers from street address column 
df_street['address.street'] = df_street['address.street'].apply(lambda x: re.sub("\d+", "", x.split('Apt', 1)[0]))

In [77]:
df_street['address.street']

0                   David Pass 
2                  Young Square
4                   Smith Vista
5                  Knapp Unions
8                Mills Estates 
10         Hailey Drives Suite 
12                 Baker Branch
13      Bennett Motorway Suite 
14         Matthew Ports Suite 
16              Timothy Viaduct
20     Dominguez Islands Suite 
21               Matthew Courts
26                 Lloyd Canyon
27                Garcia Rapids
30                  Paul Plaza 
36           Barnes Loaf Suite 
41               Mccarthy Hill 
45                  Justin Spur
49                Miller Street
50       Osborne Village Suite 
53       Campbell Groves Suite 
54                Mullins Pines
55               Graham Meadows
56                  Solis Loop 
58                  Jason Lane 
60         Bender Common Suite 
64                    Ward Lake
67          Kenneth Trafficway 
68               Elizabeth View
69             Kimberly Plains 
71            Cook Parks Suite 
75      

In [80]:
def street_names(df, suffix_list):
    '''A function to return the distinct street names. It takes a dataframe column as input. It removes street, Apt., and Suite numbers '''
    df_street['address.street'] = df_street['address.street'].apply(lambda x: re.sub("\d+", "", x.split('Apt', 1)[0]))
    return [item for item in suffix_list]

In [81]:
street_names(df, ['Apt', 'Suite'])

['Apt', 'Suite']

## 5. What are the 5 most common US area codes in the phone number field? Explain your approach to identify the US area codes in this data set.
To do: remove special charectors. Remove extensions from end starting with x, then remove last 7 numbers, then last three will be area code. 