# Idea for building a dataframe

In [5]:
# Import relevant modules
import numpy as np
import pandas as pd

## Approach 1

In [6]:
# Here's my list of requirement set together with their requirement subset
r = [('1',), ('2', 'a'), ('2', 'b'), ('2', 'c'), ('3', 'a', 'b')]

In [7]:
# Try calling pd.DataFrame on it and see the results
pd.DataFrame(r)

Unnamed: 0,0,1,2
0,1,,
1,2,a,
2,2,b,
3,2,c,
4,3,a,b


Note 1: At a first glance, it seems that this is what we want. However, after carefull consideration, we really want things like, `FILE_NAME`, `JOB_CLASS_TITLE`, etc. first. In Python, there seems not an easy way to rearrange the columns like, *SELECT [column 1, column 2] FROM*, as in SQL, so we have to abandon this approach.

## Approach 2

In [8]:
# Let's try creating only one column
pd.DataFrame(index=5)

TypeError: Index(...) must be called with a collection of some kind, 5 was passed

In [9]:
# TypeError: Index(...) must be called with a collection of some kind, 5 was passed
# Let's do it again
pd.DataFrame(index=[5, 7])

5
7


In [10]:
# Looks good. Let's do it again
pd.DataFrame(index=range(5), columns=['FILE_NAME'])

Unnamed: 0,FILE_NAME
0,
1,
2,
3,
4,


In [11]:
# Much better! Can it automatically populate?
pd.DataFrame({'col1': [2,3], 'col2': ['x']})

ValueError: arrays must all be same length

In [12]:
# ValueError: arrays must all be same length
# Ok, we can fix this with the multiplication tricks
pd.DataFrame({'col1': [3, 5], 'col2': ['x']*2})

Unnamed: 0,col1,col2
0,3,x
1,5,x


## Approach 3

In [13]:
# Let's build a dictionary of field names
k = {'FILE_NAME': None,
     'JOB_CLASS_TITLE': None
    }
k

{'FILE_NAME': None, 'JOB_CLASS_TITLE': None}

In [14]:
# Then fill this dictionary with the multiplication trick we just learned
k['FILE_NAME'] = ['x']*5
k

{'FILE_NAME': ['x', 'x', 'x', 'x', 'x'], 'JOB_CLASS_TITLE': None}

In [15]:
# Finally, let's convert this into a dataframe
pd.DataFrame(k)

Unnamed: 0,FILE_NAME,JOB_CLASS_TITLE
0,x,
1,x,
2,x,
3,x,
4,x,


In [16]:
for key, values in k.items():
    print(key, values)

FILE_NAME ['x', 'x', 'x', 'x', 'x']
JOB_CLASS_TITLE None


## Conclusion
Build a giant dictionary and hopefully, things will work.

In [17]:
field_name_dict = {'FILE_NAME': None, 
                   'JOB_CLASS_TITLE': None, 
                   'JOB_CLASS_NO': None,
                   'REQUIREMENT_SET_ID': None,
                   'REQUIREMENT_SUBSET_ID': None, 
                   'JOB_DUTIES': None,
                   'EDUCATION_YEARS': None,
                   'SCHOOL_TYPE': None,
                   'EDUCATION_MAJOR': None,
                   'EXPERIENCE_LENGTH': None,
                   'FULL_TIME_PART_TIME': None,
                   'EXP_JOB_CLASS_TITLE': None,
                   'EXP_JOB_CLASS_ALT_RESP': None,
                   'EXP_JOB_CLASS_FUNCTION': None,
                   'COURSE_COUNT': None,
                   'COURSE_LENGTH': None,
                   'COURSE_SUBJECT': None,
                   'MISC_COURSE_DETAILS': None,
                   'DRIVERS_LICENSE_REQ': None,
                   'DRIV_LIC_TYPE': None,
                   'ADDTL_LIC': None,
                   'EXAM_TYPE': None,
                   'ENTRY_SALARY_GEN': None,
                   'ENTRY_SALARY_DWP': None,
                   'OPEN_DATE': None
                  }
# Sanity check
print(len(field_name_dict)) # should be 25

25


In [26]:
pd.DataFrame(field_name_dict)

ValueError: If using all scalar values, you must pass an index

Note: The error message, "ValueError: If using all scalar values, you must pass an index", implies that we have to use the `index=` argument or a list, `[None]`.

In [28]:
# Fix the above error, passing in index=range(10)
df = pd.DataFrame(field_name_dict, index=range(10))
df

Unnamed: 0,FILE_NAME,JOB_CLASS_TITLE,JOB_CLASS_NO,REQUIREMENT_SET_ID,REQUIREMENT_SUBSET_ID,JOB_DUTIES,EDUCATION_YEARS,SCHOOL_TYPE,EDUCATION_MAJOR,EXPERIENCE_LENGTH,...,COURSE_LENGTH,COURSE_SUBJECT,MISC_COURSE_DETAILS,DRIVERS_LICENSE_REQ,DRIV_LIC_TYPE,ADDTL_LIC,EXAM_TYPE,ENTRY_SALARY_GEN,ENTRY_SALARY_DWP,OPEN_DATE
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


In [29]:
# Drop duplicates. Here, they are rows with all Nones
df.dropna()

Unnamed: 0,FILE_NAME,JOB_CLASS_TITLE,JOB_CLASS_NO,REQUIREMENT_SET_ID,REQUIREMENT_SUBSET_ID,JOB_DUTIES,EDUCATION_YEARS,SCHOOL_TYPE,EDUCATION_MAJOR,EXPERIENCE_LENGTH,...,COURSE_LENGTH,COURSE_SUBJECT,MISC_COURSE_DETAILS,DRIVERS_LICENSE_REQ,DRIV_LIC_TYPE,ADDTL_LIC,EXAM_TYPE,ENTRY_SALARY_GEN,ENTRY_SALARY_DWP,OPEN_DATE


This is actually not what I intended. I planned to fill in the dictionary first before converting it to a dataframe. However, let's carry out this approach and see what will happen.

In [32]:
# Fill JOB_CLASS_TITLE with 'Systems Analyst' five times.
df['JOB_CLASS_TITLE'] = ['Systems Analyst']*5

ValueError: Length of values does not match length of index

Oops! This doesn't work: have to fill in all the rows!

In [34]:
df['JOB_CLASS_TITLE'] = ['Systems Analyst']*10
df

Unnamed: 0,FILE_NAME,JOB_CLASS_TITLE,JOB_CLASS_NO,REQUIREMENT_SET_ID,REQUIREMENT_SUBSET_ID,JOB_DUTIES,EDUCATION_YEARS,SCHOOL_TYPE,EDUCATION_MAJOR,EXPERIENCE_LENGTH,...,COURSE_LENGTH,COURSE_SUBJECT,MISC_COURSE_DETAILS,DRIVERS_LICENSE_REQ,DRIV_LIC_TYPE,ADDTL_LIC,EXAM_TYPE,ENTRY_SALARY_GEN,ENTRY_SALARY_DWP,OPEN_DATE
0,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
1,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
2,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
3,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
4,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
5,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
6,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
7,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
8,,Systems Analyst,,,,,,,,,...,,,,,,,,,,
9,,Systems Analyst,,,,,,,,,...,,,,,,,,,,


But then I'll get stuck at `REQUIREMENT_SET_ID`. It seems that I have to run the code to calculate the number of `REQUIREMENT_SET_ID` first before doing anything else. So this means that I have to fill out my dictionary with values before converting it to a dataframe. To this end, let's go to `Objective1_c` notebook to define a **base** function that calculates the number of rows needed for a job.