# Advent of Code 2020 Day 4: Passport Processing

## Import libraries

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)

## Part 1

### Description Part 1

You arrive at the airport only to realize that you grabbed your North Pole Credentials instead of your passport. While these documents are extremely similar, North Pole Credentials aren't issued by a country and therefore aren't actually valid documentation for travel in most of the world.


It seems like you're not the only one having problems, though; a very long line has formed for the automatic passport scanners, and the delay could upset your travel itinerary.

Due to some questionable network security, you realize you might be able to solve both of these problems at the same time.

The automatic passport scanners are slow because they're having trouble detecting which passports have all required fields. The expected fields are as follows:

byr (Bir  th Yearss
iyr (Is  sue Year)
eyr (Expira  tion Year)
hgt (Height)
hcl (Hair Color)
ecl (Eye Color)
pid (Passport ID)
ci



d (Country ID)
Passport data is validated in batch files (your puzzle input). Each passport is represented as a sequence of key:value pairs separated by spaces or newlines. Passports are separated by blank lines.

Here is an example batch file containing four passports:

ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm

iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929

hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm

hcl:#cfa07d eyr:2025 pid:16655
9648
iyr:2011 ecl:brn hgt:59in
The first passport is valid - all eight fields are present. The second passport is invalid - it is missing hgt (the Height field).

The third passport is interesting; the only missing field is cid, so it looks like data from North Pole Credentials, not a passport at all! Surely, nobody would mind if you made the system temporarily ignore missing cid fields. Treat this "passport" as valid.

The fourth passport is missing two fields, cid and byr. Missing cid is fine, but missing any other field is not, so this passport is invalid.

According to the above rules, your improved system would report 2 valid passports.

Count the number of valid passports - those that have all required fields. Treat cid as optional. In your batch file, how many passports are valid?

### Import data

In [2]:
with open('day_4_input.txt') as fp:
    passport = [p.strip() for p in fp.read().split('\n\n')]

In [3]:
# Convert all delimiters
passport = [item.replace(' ', ',') for item in passport]
passport = [item.replace('\n', ',') for item in passport]
passport = [item.split(',') for item in passport]

In [4]:
passport

[['ecl:amb',
  'pid:690616023',
  'byr:1994',
  'iyr:2014',
  'hgt:172cm',
  'hcl:#c0946f',
  'eyr:2022'],
 ['eyr:1980',
  'cid:97',
  'hcl:z',
  'ecl:#102145',
  'iyr:2011',
  'byr:1945',
  'pid:187cm',
  'hgt:179in'],
 ['ecl:amb',
  'iyr:2011',
  'cid:113',
  'eyr:2021',
  'hcl:#b6652a',
  'pid:004682943',
  'byr:1940',
  'hgt:173cm'],
 ['iyr:2023',
  'cid:146',
  'byr:2022',
  'ecl:dne',
  'hgt:76in',
  'eyr:2040',
  'hcl:z'],
 ['hcl:#f97e30',
  'cid:73',
  'iyr:2013',
  'byr:1929',
  'hgt:157cm',
  'eyr:2024',
  'ecl:blu',
  'pid:673398662'],
 ['hcl:5343fe',
  'hgt:152',
  'byr:2018',
  'eyr:1992',
  'pid:85999926',
  'iyr:1938',
  'ecl:#15bd97'],
 ['byr:1975', 'hcl:z', 'eyr:1988', 'pid:#c36f52', 'iyr:2018', 'hgt:184cm'],
 ['byr:1954',
  'eyr:2023',
  'hgt:170cm',
  'iyr:2012',
  'ecl:blu',
  'pid:299556897',
  'hcl:#b6652a'],
 ['hgt:191cm',
  'ecl:oth',
  'hcl:#7d3b0c',
  'iyr:2016',
  'pid:187567535',
  'byr:1999',
  'eyr:2023'],
 ['pid:814358147',
  'eyr:2022',
  'iyr:2000',
  '

In [5]:
passport_dict = []
for sublist in passport:
    current_dict = {}
    for key_value_pair in sublist:
        key = key_value_pair.split(':')[0]
        value = key_value_pair.split(':')[1]
        current_dict[key] = value
    passport_dict.append(current_dict)

In [6]:
passport_dict

[{'ecl': 'amb',
  'pid': '690616023',
  'byr': '1994',
  'iyr': '2014',
  'hgt': '172cm',
  'hcl': '#c0946f',
  'eyr': '2022'},
 {'eyr': '1980',
  'cid': '97',
  'hcl': 'z',
  'ecl': '#102145',
  'iyr': '2011',
  'byr': '1945',
  'pid': '187cm',
  'hgt': '179in'},
 {'ecl': 'amb',
  'iyr': '2011',
  'cid': '113',
  'eyr': '2021',
  'hcl': '#b6652a',
  'pid': '004682943',
  'byr': '1940',
  'hgt': '173cm'},
 {'iyr': '2023',
  'cid': '146',
  'byr': '2022',
  'ecl': 'dne',
  'hgt': '76in',
  'eyr': '2040',
  'hcl': 'z'},
 {'hcl': '#f97e30',
  'cid': '73',
  'iyr': '2013',
  'byr': '1929',
  'hgt': '157cm',
  'eyr': '2024',
  'ecl': 'blu',
  'pid': '673398662'},
 {'hcl': '5343fe',
  'hgt': '152',
  'byr': '2018',
  'eyr': '1992',
  'pid': '85999926',
  'iyr': '1938',
  'ecl': '#15bd97'},
 {'byr': '1975',
  'hcl': 'z',
  'eyr': '1988',
  'pid': '#c36f52',
  'iyr': '2018',
  'hgt': '184cm'},
 {'byr': '1954',
  'eyr': '2023',
  'hgt': '170cm',
  'iyr': '2012',
  'ecl': 'blu',
  'pid': '299556

In [7]:
# Convert to df
dfa1 = pd.DataFrame(passport_dict)

In [8]:
dfa1

Unnamed: 0,ecl,pid,byr,iyr,hgt,hcl,eyr,cid
0,amb,690616023,1994.0,2014.0,172cm,#c0946f,2022.0,
1,#102145,187cm,1945.0,2011.0,179in,z,1980.0,97.0
2,amb,004682943,1940.0,2011.0,173cm,#b6652a,2021.0,113.0
3,dne,,2022.0,2023.0,76in,z,2040.0,146.0
4,blu,673398662,1929.0,2013.0,157cm,#f97e30,2024.0,73.0
5,#15bd97,85999926,2018.0,1938.0,152,5343fe,1992.0,
6,,#c36f52,1975.0,2018.0,184cm,z,1988.0,
7,blu,299556897,1954.0,2012.0,170cm,#b6652a,2023.0,
8,oth,187567535,1999.0,2016.0,191cm,#7d3b0c,2023.0,
9,blu,814358147,2001.0,2000.0,76in,#18171d,2022.0,


In [9]:
dfa1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253 entries, 0 to 252
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     241 non-null    object
 1   pid     241 non-null    object
 2   byr     235 non-null    object
 3   iyr     236 non-null    object
 4   hgt     241 non-null    object
 5   hcl     244 non-null    object
 6   eyr     239 non-null    object
 7   cid     129 non-null    object
dtypes: object(8)
memory usage: 15.9+ KB


### Keep only fields with no nulls

In [10]:
dfa2 = dfa1.drop('cid', axis = 1)

In [11]:
dfa2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253 entries, 0 to 252
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     241 non-null    object
 1   pid     241 non-null    object
 2   byr     235 non-null    object
 3   iyr     236 non-null    object
 4   hgt     241 non-null    object
 5   hcl     244 non-null    object
 6   eyr     239 non-null    object
dtypes: object(7)
memory usage: 14.0+ KB


In [12]:
dfa3 = dfa2.dropna()

In [13]:
dfa3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170 entries, 0 to 252
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     170 non-null    object
 1   pid     170 non-null    object
 2   byr     170 non-null    object
 3   iyr     170 non-null    object
 4   hgt     170 non-null    object
 5   hcl     170 non-null    object
 6   eyr     170 non-null    object
dtypes: object(7)
memory usage: 10.6+ KB


## Solution: 170 passports are valid for the above criteria

## Part 2

### Description Part 2

The line is moving more quickly now, but you overhear airport security talking about how passports with invalid data are getting through. Better add some data validation, quick!

You can continue to ignore the cid field, but each other field has strict rules about what values are valid for automatic validation:

byr (Birth Year) - four digits; at least 1920 and at most 2002.
iyr (Issue Year) - four digits; at least 2010 and at most 2020.
eyr (Expiration Year) - four digits; at least 2020 and at most 2030.
hgt (Height) - a number followed by either cm or in:
If cm, the number must be at least 150 and at most 193.
If in, the number must be at least 59 and at most 76.
hcl (Hair Color) - a # followed by exactly six characters 0-9 or a-f.
ecl (Eye Color) - exactly one of: amb blu brn gry grn hzl oth.
pid (Passport ID) - a nine-digit number, including leading zeroes.
cid (Country ID) - ignored, missing or not.
Your job is to count the passports where all required fields are both present and valid according to the above rules. Here are some example values:

byr valid:   2002
byr invalid: 2003

hgt valid:   60in
hgt valid:   190cm
hgt invalid: 190in
hgt invalid: 190

hcl valid:   #123abc
hcl invalid: #123abz
hcl invalid: 123abc

ecl valid:   brn
ecl invalid: wat

pid valid:   000000001
pid invalid: 0123456789
Here are some invalid passports:

eyr:1972 cid:100
hcl:#18171d ecl:amb hgt:170 pid:186cm iyr:2018 byr:1926

iyr:2019
hcl:#602927 eyr:1967 hgt:170cm
ecl:grn pid:012533040 byr:1946

hcl:dab227 iyr:2012
ecl:brn hgt:182cm pid:021572410 eyr:2020 byr:1992 cid:277

hgt:59cm ecl:zzz
eyr:2038 hcl:74454a iyr:2023
pid:3556412378 byr:2007
Here are some valid passports:

pid:087499704 hgt:74in ecl:grn iyr:2012 eyr:2030 byr:1980
hcl:#623a2f

eyr:2029 ecl:blu cid:129 byr:1989
iyr:2014 pid:896056539 hcl:#a97842 hgt:165cm

hcl:#888785
hgt:164cm byr:2001 iyr:2015 cid:88
pid:545766238 ecl:hzl
eyr:2022

iyr:2010 hgt:158cm hcl:#b6652a ecl:blu byr:1944 eyr:2021 pid:093154719
Count the number of valid passports - those that have all required fields and valid values. Continue to treat cid as optional. In your batch file, how many passports are valid?

In [14]:
# Treat value types
dfa3['byr'] = dfa3['byr'].astype('Int64')
dfa3['iyr'] = dfa3['iyr'].astype('Int64')
dfa3['eyr'] = dfa3['eyr'].astype('Int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa3['byr'] = dfa3['byr'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa3['iyr'] = dfa3['iyr'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa3['eyr'] = dfa3['eyr'].astype('Int64')


In [15]:
dfa3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170 entries, 0 to 252
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     170 non-null    object
 1   pid     170 non-null    object
 2   byr     170 non-null    Int64 
 3   iyr     170 non-null    Int64 
 4   hgt     170 non-null    object
 5   hcl     170 non-null    object
 6   eyr     170 non-null    Int64 
dtypes: Int64(3), object(4)
memory usage: 11.1+ KB


### Condition 1: 'byr'

In [16]:
# First condition: byr (Birth Year) - four digits; at least 1920 and at most 2002
dfa4 = dfa3[(dfa3['byr'] >= 1920) & (dfa3['byr'] <= 2002)].reset_index(drop = True)

In [17]:
dfa4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136 entries, 0 to 135
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     136 non-null    object
 1   pid     136 non-null    object
 2   byr     136 non-null    Int64 
 3   iyr     136 non-null    Int64 
 4   hgt     136 non-null    object
 5   hcl     136 non-null    object
 6   eyr     136 non-null    Int64 
dtypes: Int64(3), object(4)
memory usage: 8.0+ KB


### Condition 2: 'iyr'

In [18]:
# Second condition: iyr (Issue Year) - four digits; at least 2010 and at most 2020
dfa5 = dfa4[(dfa4['iyr'] >= 2010) & (dfa4['iyr'] <= 2020)]

In [19]:
dfa5.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, 0 to 135
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     122 non-null    object
 1   pid     122 non-null    object
 2   byr     122 non-null    Int64 
 3   iyr     122 non-null    Int64 
 4   hgt     122 non-null    object
 5   hcl     122 non-null    object
 6   eyr     122 non-null    Int64 
dtypes: Int64(3), object(4)
memory usage: 8.0+ KB


### Condition 3: 'eyr'

In [20]:
# Third condition: eyr (Expiration Year) - four digits; at least 2020 and at most 2030
dfa6 = dfa5[(dfa5['eyr'] >= 2020) & (dfa5['eyr'] <= 2030)]

In [21]:
dfa6.info()

<class 'pandas.core.frame.DataFrame'>
Index: 115 entries, 0 to 135
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ecl     115 non-null    object
 1   pid     115 non-null    object
 2   byr     115 non-null    Int64 
 3   iyr     115 non-null    Int64 
 4   hgt     115 non-null    object
 5   hcl     115 non-null    object
 6   eyr     115 non-null    Int64 
dtypes: Int64(3), object(4)
memory usage: 7.5+ KB


### Condition 4: 'hgt'

In [22]:
# Fourth condition: hgt (Height) - a number followed by either cm or in:
# If cm, the number must be at least 150 and at most 193
# If in, the number must be at least 59 and at most 76

# Remove fields without 'cm' or 'in'
dfa7 = dfa6[dfa6['hgt'].str.contains('cm|in') == True]

In [23]:
# Create unit columnn
dfa7.loc[dfa7['hgt'].str.contains('cm', na=False), 'hgt_unit'] = 'cm'
dfa7.loc[dfa7['hgt'].str.contains('in', na=False), 'hgt_unit'] = 'in'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa7.loc[dfa7['hgt'].str.contains('cm', na=False), 'hgt_unit'] = 'cm'


In [24]:
# Strip units from 'hgt' column
dfa7['hgt'] = dfa7['hgt'].str.strip('cm|in')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa7['hgt'] = dfa7['hgt'].str.strip('cm|in')


In [25]:
dfa7['hgt'] = dfa7['hgt'].astype('Int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfa7['hgt'] = dfa7['hgt'].astype('Int64')


In [26]:
dfa7.info()

<class 'pandas.core.frame.DataFrame'>
Index: 114 entries, 0 to 135
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ecl       114 non-null    object
 1   pid       114 non-null    object
 2   byr       114 non-null    Int64 
 3   iyr       114 non-null    Int64 
 4   hgt       114 non-null    Int64 
 5   hcl       114 non-null    object
 6   eyr       114 non-null    Int64 
 7   hgt_unit  114 non-null    object
dtypes: Int64(4), object(4)
memory usage: 8.5+ KB


In [27]:
# Fourth condition: hgt (Height) - a number followed by either cm or in:
# If cm, the number must be at least 150 and at most 193
# If in, the number must be at least 59 and at most 76
dfa8 = dfa7[((dfa7['hgt_unit'] == 'cm') & (dfa7['hgt'] >= 150) & (dfa7['hgt'] <= 193)) | 
        ((dfa7['hgt_unit'] == 'in') & (dfa7['hgt'] >= 59) & (dfa7['hgt'] <= 76))]

In [28]:
dfa8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 135
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ecl       111 non-null    object
 1   pid       111 non-null    object
 2   byr       111 non-null    Int64 
 3   iyr       111 non-null    Int64 
 4   hgt       111 non-null    Int64 
 5   hcl       111 non-null    object
 6   eyr       111 non-null    Int64 
 7   hgt_unit  111 non-null    object
dtypes: Int64(4), object(4)
memory usage: 8.2+ KB


### Condition 5: 'hcl'

In [29]:
# Fifth condition: hcl (Hair Color) - a # followed by exactly six characters 0-9 or a-f
dfa9 = dfa8[dfa8['hcl'].str.contains(r'#[0-9a-fA-F]{6}') == True]

In [30]:
dfa9.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109 entries, 0 to 135
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ecl       109 non-null    object
 1   pid       109 non-null    object
 2   byr       109 non-null    Int64 
 3   iyr       109 non-null    Int64 
 4   hgt       109 non-null    Int64 
 5   hcl       109 non-null    object
 6   eyr       109 non-null    Int64 
 7   hgt_unit  109 non-null    object
dtypes: Int64(4), object(4)
memory usage: 8.1+ KB


### Condition 6: 'ecl'

In [31]:
# Sixth condition: ecl (Eye Color) - exactly one of: amb blu brn gry grn hzl oth
dfa10 = dfa9[dfa9['ecl'].str.contains('amb|blu|brn|gry|grn|hzl|oth') == True]

In [32]:
dfa10.info()

<class 'pandas.core.frame.DataFrame'>
Index: 107 entries, 0 to 135
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ecl       107 non-null    object
 1   pid       107 non-null    object
 2   byr       107 non-null    Int64 
 3   iyr       107 non-null    Int64 
 4   hgt       107 non-null    Int64 
 5   hcl       107 non-null    object
 6   eyr       107 non-null    Int64 
 7   hgt_unit  107 non-null    object
dtypes: Int64(4), object(4)
memory usage: 7.9+ KB


### Condition 7: 'pid'

In [33]:
# Seventh condition: pid (Passport ID) - a nine-digit number, including leading zeroes
dfa11 = dfa10[dfa10['pid'].str.contains(r'^\d{9}$') == True]

In [34]:
dfa11.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103 entries, 0 to 135
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ecl       103 non-null    object
 1   pid       103 non-null    object
 2   byr       103 non-null    Int64 
 3   iyr       103 non-null    Int64 
 4   hgt       103 non-null    Int64 
 5   hcl       103 non-null    object
 6   eyr       103 non-null    Int64 
 7   hgt_unit  103 non-null    object
dtypes: Int64(4), object(4)
memory usage: 7.6+ KB


## Solution: 103 passports are valid for the above criteria