# Table of Contents

After [setting up](#Set-up) the python environment, we are going to:
- read the source file into memory
- determine the variable on which we will split files
- create an ordered list of values for that variable and save it as an object
- recursively do the following until we've done this for all chunks
    - overwrite the df with the slice that represents a chunk
    - add headers to the file
    - create a name for the file and save as an object
    - write the resulting file out to a folder with that name

# Set up

In [1]:
import pandas as pd
import numpy as np
import io, glob

# Initial Reading of Data (1st Lambda)

[back to top](#Table-of-Contents)

The data will flow through three folders:
- the input folder: 
    this folder holds the large files which need to be chunked
- the intermediate folder: 
    this folder will hold intermediate files which have data associated with an individual booklet.  The data in these files may not contain all events associated with a particular student, depending on the value assigned to the chunksize argument supplied to pandas.read_csv and the size of the data associate with that student 
- output folder:
    This is the folder where the final chunks will be placed. Each file here contains all of the events associated with the student indicated in the filename

In [14]:
outputFolder = 'C:/Users/cagard/OneDrive - Educational Testing Service/Testing/'

In [5]:
####################################################
# This function 
# x = The assigned student number
# chunks = the number of individual chunks to assign
####################################################
def assignChunk(x,chunks=chunks):
    return min([c for c in chunks if x<c])

The cell below creates the reference dataframe that deals with a single file.

In [2]:
df = pd.read_csv('C:/Users/cagard/Downloads/ST4/2017Math_1717MA4D01GXXX02EX_obs_tab_delimited_headers_CA.csv')

refdf = pd.DataFrame(df.BookletNumber.unique())
refdf = refdf.reset_index().rename(columns={'index':'Student', 0:'BookletNumber'})
print("There are %d unique Booklets in the source file."%refdf.BookletNumber.nunique())

There are 358228 unique Booklets in the source file.


# Chunking Data (also 1st Lambda)
Here we will try to save chunks. For each chunk in the file, we will:
- identify the list of booklets in that chunk and save all data for each booklet to it's own file.

In [6]:
####################################################
#set number of booklets to a chunk  This is the same for either type of file
####################################################
chunksize = 1000
nchunks = int(np.ceil(refdf['Student'].max()/1000))
print('number of chunks',nchunks)

chunks = [(c+1) *chunksize for c in range(nchunks)]
chunks
####################################################
# take the 'Student' column and pass it to the function
# assignChunk() and assign the response to a new
# column in refdf.
####################################################
refdf['chunk'] = refdf['Student'].apply(assignChunk)

number of chunks 359


In [10]:
####################################################
# Code improvement: creation of BlockCode column works for the current format
# must be improved for other formats (standard file format). 
# Ideally we should do this with regular expression
####################################################

# refdf['BlockCode'] = refdf.File[0].split('_')[-3].split('.csv')[0]
refdf['BlockCode'] = df.BlockCode[0]
for chunk in refdf.chunk.unique():
    try:
        refdf['startEnd'] = refdf.chunk.apply(lambda x: str(x-chunksize+1)+'-'+str(x))
#         print('refdf', refdf['startEnd'])
    #     refdf.startEnd.unique()
        #merge to file
#         dfLarge = dfLarge.merge(refdf,on='BookletNumber')
        refdf['fileType'] = "raw"
        refdf['outfile'] = refdf.fileType + "_" + refdf.BlockCode.astype(str)+'_'+refdf.startEnd.astype(str)+'.csv'
        refdf.head()
    except:
        print('exception')
        #write files to chunks
print('done')

# Single file
# print(df.shape)

df = df.merge(refdf, how="left", on=['BookletNumber', 'BlockCode'])
# print(df.shape)
# df.head()

# single file
df.groupby('outfile').apply(lambda x: x.to_csv(outputFolder+x.outfile[0], index=False))

done


# Checking outfiles
We will read in a few of the files that were just generated from the code above. 
Checking the distribution of students.

In [16]:
outFileList = glob.glob(outputFolder+"raw_1717*.csv")
print(len(outFileList))

359


In [29]:
####################################################
# Note: No more than one file can have fewer number of Students than the chunk size
#     No files should have more than the number of chunks. 
#     All files should have the same number of columns.
# This will show how many students as well as the shape (columns)
####################################################
# df = pd.read_csv(outFileList[0])
# print(df.BookletNumber.nunique(), df.shape[1])
# df.head()
colList = []
# bookNumList
for file in outFileList:
    tempCols = pd.read_csv(file,nrows=5).columns
    colList.append(len(tempCols))
#     tempBookletNumbers = pd.read_csv(file,usecols = ['BookletNumber']).nunique()
    
#     print('File {}: Number of Students (Booklet Numbers) {}\n Number of columns {} (columns names: {})\n'\
#           .format(file, tempBookletNumbers, len(tempCols), tempCols))
list(set(colList))
# uniqueColList = [subList for subList in colList if len([sL for sL in colList if outFileList>1])!>1]
# if len(uniqueColList)>0:
#     print("The following sets of columns are unique among the processed files")
          

[15]