# Remove fields based on ID column

### Description

- Use when only keeping a portion of the dataset contained within a template
- Can also use for large dataset cleaning
- Filter based on ID string contents a categorical column to keep only fields within a subset of the data in template
- Useful when template contains subsets of data not needed in final dataset
- Can incorporate into cleaning function when iterating through files in a directory 

## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Import data

In [2]:
dfa1 = pd.read_csv('remove_fields_id_raw.csv')

In [3]:
dfa1

Unnamed: 0,id1,id2,id3,number,letter
0,K1,KI1,K1,1,a
1,K2,KI2,IK2,2,b
2,K3,KI3,KSI3,3,c
3,P1,PQ1,P1,4,d
4,P2,PQ2,QP2,5,e
5,P3,PQ3,PSQ3,6,f
6,L1,LM1,L1,7,g
7,L2,LM2,ML2,8,h
8,L3,LM3,LSM3,9,i


## Create a new dataframe only including the desired subset of data based on ID

### If IDs can be distinguished by the presence of a certain letter/number:

In [4]:
dfa2 = dfa1[dfa1['id1'].str.contains('P') == True]

In [5]:
dfa2

Unnamed: 0,id1,id2,id3,number,letter
3,P1,PQ1,P1,4,d
4,P2,PQ2,QP2,5,e
5,P3,PQ3,PSQ3,6,f


### If IDs can be distinguished by the presence of a combination of letters/numbers, always in same order:

In [6]:
dfa3 = dfa1[dfa1['id2'].str.contains('PQ') == True]

In [7]:
dfa3

Unnamed: 0,id1,id2,id3,number,letter
3,P1,PQ1,P1,4,d
4,P2,PQ2,QP2,5,e
5,P3,PQ3,PSQ3,6,f


### If IDs can be distinguished by the presence of one letter OR another, NOT always in same order:

In [8]:
# This searches the strings using regex (| = OR)
dfa4 = dfa1[dfa1['id3'].str.contains('P|Q') == True]

In [9]:
dfa4

Unnamed: 0,id1,id2,id3,number,letter
3,P1,PQ1,P1,4,d
4,P2,PQ2,QP2,5,e
5,P3,PQ3,PSQ3,6,f


### To keep all fields *except* the ones that contain either of those letters:

In [10]:
dfa5 = dfa1[~dfa1['id3'].str.contains('P|Q') == True]

In [11]:
dfa5

Unnamed: 0,id1,id2,id3,number,letter
0,K1,KI1,K1,1,a
1,K2,KI2,IK2,2,b
2,K3,KI3,KSI3,3,c
6,L1,LM1,L1,7,g
7,L2,LM2,ML2,8,h
8,L3,LM3,LSM3,9,i


### If IDs can be distinguished by the presence of one letter AND another, NOT always in same order:

In [12]:
# This searches the strings using regex (| = OR, .* indicates there may be characters between those listed)
dfa6 = dfa1[dfa1['id3'].str.contains('P.*Q|Q.*P') == True]

In [13]:
dfa6

Unnamed: 0,id1,id2,id3,number,letter
4,P2,PQ2,QP2,5,e
5,P3,PQ3,PSQ3,6,f


## Notes for robust data filtering

- Use case = False to ignore capital vs. lowercase in search
- If nulls are present, str.contains won't work- use na = False to ignore non-string values in column
- Can add other regex to augment filter capabilities 
- If adding regex to search, the following characters are interpreted by regex as regex characters (use re.escape() around regex to force it to interpret literally):
- . ^ $ * + ? { } [ ] \ | ( )
- Can also use regex = False for above characters if not using regex at all & they should be interpreted as literal characters to search for within string

## Export data

In [14]:
dfa2.to_csv('cleaned_remove_fields_id_P_only.csv', encoding = 'utf-8', index = False, header = True)

dfa3.to_csv('cleaned_remove_fields_id_PQ_only.csv', encoding = 'utf-8', index = False, header = True)

dfa4.to_csv('cleaned_remove_fields_id_P_or_Q_only.csv', encoding = 'utf-8', index = False, header = True)

dfa5.to_csv('cleaned_remove_fields_id_not_P_or_Q.csv', encoding = 'utf-8', index = False, header = True)

dfa6.to_csv('cleaned_remove_fields_id_P_and_Q.csv', encoding = 'utf-8', index = False, header = True)