# Function: csv_columnNames_to_rows

This function takes a folder of csv files and transforms the columns to a new data frame to compare the column names from file to file.  

The point of this function is to allow easier adjustment of column names for merging of csv files that are similar in data, but have semantic differences from file-to-file (such as year to year).   

### Directions:
 Ensure the folder path is the same as your notebook    
 Pass in the folder name as the function argument  
 The function requires the libraries: *pandas* and *os*  

In [1]:
# Import libraries
import pandas as pd
import os

# Specify folder name
dirname = 'COBRA-Data' # This is the example folder I used

In [2]:
# Step 1) import the csv files
def import_csv_lst(dirname):
    
    print('Importing:', os.listdir(dirname), '\n...')
    files = os.listdir(dirname)
    dfs = []
    x = 0
    
    while x < len(files):
        assert files[x].endswith('.csv'), 'Error: all files must be csv'
        file = ('COBRA-Data/' + files[x])
        print('loading:', file)
        df = pd.read_csv(file, low_memory=False)
        x+=1
        dfs.append(df)

    if len(dfs) == len(files):
        print('... \nSuccess')
    else:
        print('... \nerror')
    
    return (dfs, files)



# Step 2) Create a dictionary of {(file_name: column.names)} by passing in the output of step 1 
def csv_and_cols_to_dict(output):
    d = {}
    c = 0
    ls = []
    files = output[1]
    
    for i in output[0]:
        for e in i.columns:
            ls.append(e) 
        d[files[c]] = ls
        c+=1
        ls = []
    return d



# Step 3) Returns max length of a value (column.names) for any key (file_name) in dictionary
def max_dict_len(d):
    max_len = 0
    for k in d:
        if len(d[k]) > max_len:
            max_len = len(d[k])
    return max_len




# Step 4) Adjusts lengths of values (column.names) to allow for data frame creation
def len_adjust(d):
    max_len = max_dict_len(d) #retrieves max len
    for i in d:
        if len(d[i]) < max_len:
            dif = max_len - len(d[i])
            ls = []
            for t in range(dif):
                d[i].append('XXX')
    return d



# Step 5) Convert columns of each dataframe from a list of dataframes into a table (rows=column.names, columns=file_name)
def dfNameToCol_dfColToRow(csv_colunmNames_to_rows):
    return len_adjust(csv_and_cols_to_dict(csv_colunmNames_to_rows)) # returns adjusted dictionary 



# Step 6) Converts a dictionary to a data frame
def dfsLst_to_rowsOfColumns(csv_colunmNames_to_rows):
    return pd.DataFrame.from_dict(dfNameToCol_dfColToRow(csv_colunmNames_to_rows))



# Step 7) Tie it all together
def csv_colunmNames_to_rows(dirname):    
    return dfsLst_to_rowsOfColumns(import_csv_lst(dirname))

# Testing function 
The test data is from the Atlanta Police Department's crime statistics.  I used the five files in the raw crime data download section.  
https://www.atlantapd.org/i-want-to/crime-data-downloads  


In [4]:
csv_colunmNames_to_rows(dirname)

Importing: ['COBRA-2009-2019.csv', 'COBRA-2020(NEW RMS 9-30 12-31).csv', 'COBRA-2020-OldRMS-09292020.csv', 'COBRA-2021.csv', 'COBRA-2022.csv'] 
...
loading: COBRA-Data/COBRA-2009-2019.csv
loading: COBRA-Data/COBRA-2020(NEW RMS 9-30 12-31).csv
loading: COBRA-Data/COBRA-2020-OldRMS-09292020.csv
loading: COBRA-Data/COBRA-2021.csv
loading: COBRA-Data/COBRA-2022.csv
... 
Success


Unnamed: 0,COBRA-2009-2019.csv,COBRA-2020(NEW RMS 9-30 12-31).csv,COBRA-2020-OldRMS-09292020.csv,COBRA-2021.csv,COBRA-2022.csv
0,Report Number,offense_id,offense_id,offense_id,offense_id
1,Report Date,rpt_date,rpt_date,rpt_date,rpt_date
2,Occur Date,occur_date,occur_date,occur_date,occur_date
3,Occur Time,occur_time,occur_time,occur_day,occur_day
4,Possible Date,poss_date,poss_date,occur_day_num,occur_day_num
5,Possible Time,poss_time,poss_time,occur_time,occur_time
6,Beat,beat,beat,poss_date,poss_date
7,Apartment Office Prefix,apt_office_prefix,apartment_office_prefix,poss_time,poss_time
8,Apartment Number,apt_office_num,apartment_number,beat,beat
9,Location,location,location,zone,zone


# Now it is easier to compare column names and make adjustments before exploring the data further!