# Making CSV Files from Individual JSON Files

## First import the libraries needed

In [1]:
import pandas as pd
import os
import json


# initialize blank list
applicantList = []

with os.scandir("applicants") as fileList:
    for entry in fileList:
        if entry.name.endswith(".json") and entry.is_file():
            with open(entry.path) as f:
                # read the contents into contents
                contents = f.read()
                # interpret the contents as JSON and append to list
                applicantList.append(json.loads(contents))
                
# convert list to Data frame
applicantDataFrame = pd.DataFrame(applicantList)

# what kind of uniqueness am I facing?
applicantDataFrame.nunique()

id                  5000
first_name          3154
last_name           3852
email               4000
years_experience      20
latitude            3736
longitude           3725
python_years          15
pandas_years           5
us_citizen             2
job_applied_for      195
highest_ed             5
date_applied         397
dtype: int64

Ah, so I have 195 unique jobs that this group has applied for - interesting

## Peel off the Jobs

Let's start there and peel off the unique jobs into a data frame

In [2]:

uniqueJobs = pd.DataFrame(applicantDataFrame,columns=['job_applied_for']).drop_duplicates()

uniqueJobs



Unnamed: 0,job_applied_for
0,Physical Therapy Assistant
1,Administrative Officer
2,Assistant Media Planner
4,Human Resources Manager
5,Payment Adjustment Coordinator
...,...
2011,Office Assistant IV
2153,Budget/Accounting Analyst II
2607,Developer I
2651,Accountant III


In [3]:
# Now, let's add a key to each row
uniqueJobs['jobId'] = range(1, 1+len(uniqueJobs))

# set the index to be our custom column, but don't drop the jobId column either
uniqueJobs = uniqueJobs.set_index(['jobId'], drop = False)

uniqueJobs

Unnamed: 0_level_0,job_applied_for,jobId
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Physical Therapy Assistant,1
2,Administrative Officer,2
3,Assistant Media Planner,3
4,Human Resources Manager,4
5,Payment Adjustment Coordinator,5
...,...,...
191,Office Assistant IV,191
192,Budget/Accounting Analyst II,192
193,Developer I,193
194,Accountant III,194


Cool - we have 195 rows and a new jobId to use.

## Peel Off the Applicants

Now, let's get unique people, judged by all the people-related fields together. Do this just like the jobs.

In [11]:
# Put code here to get the unique jobs

# Then add the unique key to each row

# Then re-index (set the index to the unique key)


## Correlate Jobs to Applicants

### Don't Do it This Way

One thing we could do to build a crosswalk would be the brute force method - iterate through the raw file, lookup the jobId and participantId for each of its 5,000 rows, the add a new row each time to our clean crosswalk table of applications.

This is not a great idea, but it might be intuitive to you, so the example is provided:

In [8]:
# Let's build our crosswalk

applicationsCrosswalk = pd.DataFrame(columns = ["applicantId","jobId"])

# iterate through the raw data from the files

for index, row in applicantDataFrame.iterrows():
    # find the row in our unique jobs frame that has this name
    matchingJobRow = uniqueJobs.loc[uniqueJobs['job_applied_for'] == row['job_applied_for']]
    # find the row in our unique people frame that has this email
    matchingPersonRow = uniqueApplicants.loc[uniqueApplicants['email'] == row['email']]
    # add a row to the applications crosswalk that correlates the job with the person
    applicationsCrosswalk.loc[len(applicationsCrosswalk.index)] = [matchingPersonRow['applicantId'].iloc[0],matchingJobRow['jobId'].iloc[0]] 
    
applicationsCrosswalk


Unnamed: 0,applicantId,jobId
0,1,1
1,2,2
2,3,3
3,4,2
4,5,4
...,...,...
4995,3997,126
4996,3998,168
4997,3228,86
4998,3999,2


### A Better Way to Make the Connection

A better way to do this would be to use merges on the original raw data to insert the keys of our jobs and applicants.

In [9]:
# merge the original raw dataframe with the unique job dataframe, matching by job applied for (title)
applicationCrossWalk2 = applicantDataFrame.merge(uniqueJobs,on="job_applied_for",how="left")

#merge in the applicants, matching by email

#reduce down to just the fields we need for our crosswalk



Unnamed: 0,applicantId,jobId
0,1,1
1,2,2
2,3,3
3,4,2
4,5,4
...,...,...
4995,3997,126
4996,3998,168
4997,3228,86
4998,3999,2


We now have three data frames - unique applicants, unique jobs, and the correlation between the two, so let's save them as .csv files.

## Output the Resulting Files

In [10]:
# Save the Unique Applicants
uniqueJobs.to_csv(r'~/Desktop/applicants.csv', index = None, header=True)

# Save the Unique Jobs

# Save the Applications
