Script: Step 2 - Load File to SQL Server with Data Types.ipynb

Created by: Joshua Wilshere

Created on: 6/24/2025

Purpose: Load a csv or Excel file to SQL Server with manually set data types, profile it, and compare with the AllChar table loaded in Step 1

Python version used for testing: 3.12

<u>Dependencies</u><br>
1. '.env' file in this same directory with the following environment variables populated:
    - SQLServerDomain
    - SQLServerHost
    - SQLServerDatabase

2. Scripts: 
    - ..\Data Profiling\Create Stored Procedure meta.ProfileSingleTable.sql
    - ..\Data Profiling\Create Stored Procedure meta.CompareCharAndTypedTableProfiles.sql

<u>Basic Steps To Load and Profile Data:</u><br>

1. Load data as characters with Step 1 - Load File to SQL Server as Characters.ipynb<br>
    a. This script also creates the data profiling stored procedure and populates meta.TABLE_PROFILES with the profile of the _AllChar table
2. Load data with specified data types with Step 3 - Load File to SQL Server with Data Types.ipynb<br>
    b. This notebook populates the meta.TABLE_PROFILES with the profile of the data typed table and runs the comparison query to compare the AllChar and typed profiles

In [1]:
import os
from datetime import datetime as dttm
from datetime import date as dt
from datetime import timezone

from dotenv import load_dotenv
import numpy as np
import pandas as pd
import pyodbc
import sqlalchemy as sa
from sqlalchemy.engine import URL
from sqlalchemy import create_engine
# openpyxl only needed for Excel files
import openpyxl

In [2]:
# Load the environment variables from the .env file
load_dotenv(override=True)

True

In [3]:
############# Change filePath ######################
# Local file path format
# filePath = r'C:\Users\<username>\path\to\file\sampleData'
filePath = r'sampleData'

# Remote file path format 
# Note - due to VPN double hop it's not recommmended to load larger files from network drive to SQL Server
#filePath = r'//<shareservername>/<driveletter>/SampleData/'

fileName = r'green_tripdata_2020-04.xlsx'

fullPath = os.path.join(filePath, fileName)

Use below code to load data from CSV. If data is in Excel, comment (CTRL+K CTRL+C) or remove this block and skip to next two code blocks.

In [4]:
# # Load CSV file into Pandas data frame
# df = pd.read_csv(fullPath)

# # It is okay to ignore the following DtypeWarning - it defaults those columns to object datatype which is what we'll do to all the columns anyway
# # DtypeWarning: Columns (....) have mixed types. Specify dtype option on import or set low_memory=False.

Uncomment (CTRL+K CTRL+U) and use the following 2 code blocks to load data from Excel. 

In [5]:
# Load Excel file into Pandas
df_ExcelFile = pd.ExcelFile(fullPath)

# Get Excel Sheet name and assign to variable
# Increment the index # based on the sheet you want to load. 0 is first sheet
sheetName = df_ExcelFile.sheet_names[0]
# Print the sheet name to confirm choice
print(sheetName)

green_tripdata_2020-04


In [6]:
# Load specified Excel sheet into Pandas dataframe
df = df_ExcelFile.parse(sheetName)

# Use below if the header row is not in the first row:
#df = df_ExcelFile.parse(sheetName,skiprows=4)

In [7]:
# Set Schema and Table Name for Target SQL Server Table
schema_name = 'dbo'
base_table_name = 'green_tripdata_2020-04_Excel'
table_name = '['+schema_name+'].['+base_table_name+']'
print(table_name)

[dbo].[green_tripdata_2020-04_Excel]


In [8]:
# Set the name of the previously loaded table with all character columns for use in the table profile query
allchar_table_name = base_table_name + '_AllChar'

In [9]:
# Set Database Connection Properties
SQLServerHost = os.getenv('SQLServerHost')
SQLServerDomain = os.getenv('SQLServerDomain')
SQLServerDatabase = os.getenv('SQLServerDatabase')

In [10]:
# Initialize SQL Server Connection
Driver='{ODBC Driver 17 for SQL Server}'
Server=f'tcp:{SQLServerHost}.{SQLServerDomain}'
Database=f'{SQLServerDatabase}'
MARS_Connection='Yes'
Trusted_Connection='yes'

connection_str = 'DRIVER='+ Driver +';SERVER='+Server+';DATABASE='+Database+';MARS_Connection='+MARS_Connection+';Trusted_Connection='+ Trusted_Connection
mssql_conn = pyodbc.connect(connection_str)

# Convert pyodbc connection to SQLAlchemy engine to load data SQL Server data directly into Pandas dataframe
connection_url = URL.create('mssql+pyodbc', query={'odbc_connect': connection_str})
engine = create_engine(connection_url)

In [11]:
# Increase number of rows/columns data frames will display in a print command
# Reference: https://pandas.pydata.org/docs/user_guide/options.html
# Also recommended - enable output scrolling by opening VSCode User Settings, searching for notebook.output.scrolling and checking the box
pd.options.display.max_rows = 999
pd.options.display.max_columns = None

In [12]:
# Review pandas interpolated datatypes and column names

print(df.dtypes)

VendorID                        float64
lpep_pickup_datetime     datetime64[ns]
lpep_dropoff_datetime            object
store_and_fwd_flag               object
RatecodeID                      float64
PULocationID                      int64
DOLocationID                      int64
passenger_count                 float64
trip_distance                   float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
ehail_fee                       float64
improvement_surcharge           float64
total_amount                    float64
payment_type                    float64
trip_type                       float64
congestion_surcharge            float64
dtype: object


In [13]:
# Review only non-object datatypes and column names
print(df.dtypes[df.dtypes != 'object'])

VendorID                        float64
lpep_pickup_datetime     datetime64[ns]
RatecodeID                      float64
PULocationID                      int64
DOLocationID                      int64
passenger_count                 float64
trip_distance                   float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
ehail_fee                       float64
improvement_surcharge           float64
total_amount                    float64
payment_type                    float64
trip_type                       float64
congestion_surcharge            float64
dtype: object


In [14]:
# Create dictionary of columns to force load them as 'str' (aka 'object') data type
# Note - setting the datatype to 'str' instead of 'object' works better on Excel-based dataframes
var_dict = {}
for i in range(df.shape[1]):
    var_str = ''
    var_str = "'{}':'object'".format(df.columns[i])
    var_dict[df.columns[i]] = 'str'
print(var_dict)

{'VendorID': 'str', 'lpep_pickup_datetime': 'str', 'lpep_dropoff_datetime': 'str', 'store_and_fwd_flag': 'str', 'RatecodeID': 'str', 'PULocationID': 'str', 'DOLocationID': 'str', 'passenger_count': 'str', 'trip_distance': 'str', 'fare_amount': 'str', 'extra': 'str', 'mta_tax': 'str', 'tip_amount': 'str', 'tolls_amount': 'str', 'ehail_fee': 'str', 'improvement_surcharge': 'str', 'total_amount': 'str', 'payment_type': 'str', 'trip_type': 'str', 'congestion_surcharge': 'str'}


Use below code to load data from CSV. If data is in Excel, comment (CTRL+K CTRL+C) or remove this block and skip to next code block.

In [15]:
# # Reload CSV file with all datatypes set to 'object'
# ############ Change filePath ######################
# df = pd.read_csv(fullPath, dtype=var_dict)

Uncomment (CTRL+K CTRL+U) and use the following code block to load data from Excel.

In [16]:
# Reload Excel file with all datatypes set to 'object'
############# Change filePath ######################
df = pd.read_excel(fullPath, sheet_name=sheetName, dtype=var_dict, engine='openpyxl')

# Use below if header row is not first row
#df = pd.read_excel(fullPath, sheet_name=sheetName, dtype=var_dict, engine='openpyxl', skiprows=4)

In [17]:
# Confirm that all dtypes are set to 'object'
# Expected output: Series([], dtype: object)
print(df.dtypes[df.dtypes != 'object'])

Series([], dtype: object)


In [18]:
# Set and confirm starting records counts
rowcount1 = df.shape[0]
print('{} records'.format(rowcount1))

# Set and confirm starting column count
colcount1 = df.shape[1]
print('{} columns'.format(colcount1))

35612 records
20 columns


In [19]:
# Add filename column to dataframe
df['AUD_DIRECTORY']=filePath
df['AUD_FILENAME']=fileName

In [20]:
# Construct the query of the meta.TABLE_PROFILES table to get the profile of the table with all character columns
allCharProfileQuery = f"SELECT * FROM meta.TABLE_PROFILES WHERE SCHEMA_NAME = \'{schema_name}\' AND TABLE_NAME = \'{allchar_table_name}\'"
print(allCharProfileQuery)

SELECT * FROM meta.TABLE_PROFILES WHERE SCHEMA_NAME = 'dbo' AND TABLE_NAME = 'green_tripdata_2020-04_Excel_AllChar'


In [21]:
# Read in table profile of all character table to use as a reference for setting data types below
with engine.begin() as conn:
    df_allchar_profile = pd.read_sql_query(sa.text(allCharProfileQuery), conn)
df_allchar_profile

Unnamed: 0,SCHEMA_NAME,TABLE_NAME,COLUMN_ID,COLUMN_NAME,DATATYPE,MAX_LEN,MAX_LEN_VALUE,MIN_LEN,NULL_COUNT,BLANK_COUNT,NUMERIC_VALUE_COUNT,MAX_PRECISION,MAX_SCALE,DISTINCT_COUNT,TOTAL_RECORD_COUNT
0,dbo,green_tripdata_2020-04_Excel_AllChar,1,[VENDORID],NVARCHAR(20),1,2,0,0,11110,24502.0,,,3,35612
1,dbo,green_tripdata_2020-04_Excel_AllChar,2,[LPEP_PICKUP_DATETIME],NVARCHAR(20),19,2020-04-01 00:44:02,19,0,0,0.0,,,32973,35612
2,dbo,green_tripdata_2020-04_Excel_AllChar,3,[LPEP_DROPOFF_DATETIME],NVARCHAR(20),19,2020-04-01 00:52:23,0,0,2,0.0,,,32982,35612
3,dbo,green_tripdata_2020-04_Excel_AllChar,4,[STORE_AND_FWD_FLAG],NVARCHAR(20),1,N,0,0,11110,0.0,,,3,35612
4,dbo,green_tripdata_2020-04_Excel_AllChar,5,[RATECODEID],NVARCHAR(20),1,1,0,0,11110,24502.0,,,7,35612
5,dbo,green_tripdata_2020-04_Excel_AllChar,6,[PULOCATIONID],NVARCHAR(20),3,244,1,0,0,35612.0,,,235,35612
6,dbo,green_tripdata_2020-04_Excel_AllChar,7,[DOLOCATIONID],NVARCHAR(20),3,247,1,0,0,35612.0,,,249,35612
7,dbo,green_tripdata_2020-04_Excel_AllChar,8,[PASSENGER_COUNT],NVARCHAR(20),1,1,0,0,11110,24502.0,,,10,35612
8,dbo,green_tripdata_2020-04_Excel_AllChar,9,[TRIP_DISTANCE],NVARCHAR(20),8,26471.58,0,0,1,35611.0,7.0,2.0,2569,35612
9,dbo,green_tripdata_2020-04_Excel_AllChar,10,[FARE_AMOUNT],NVARCHAR(20),6,-24.46,1,0,0,35612.0,4.0,2.0,4043,35612


In [22]:
# Review contents of dataframe and compare to source file.
df.head(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,AUD_DIRECTORY,AUD_FILENAME
0,2,2020-04-01 00:44:02,2020-04-01 00:52:23,N,1,42,41,1,1.68,8.0,0.5,0.5,0.0,0,,0.3,9.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
1,2,2020-04-01 00:24:39,2020-04-01 00:33:06,N,1,244,247,2,1.94,9.0,0.5,0.5,0.0,0,,0.3,10.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
2,2,2020-04-01 00:45:06,,N,1,244,243,3,1.0,6.5,0.5,0.5,0.0,0,,0.3,7.8,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
3,2,2020-04-01 00:45:06,2020-04-01 01:04:39,N,1,244,243,2,,12.0,0.5,0.5,0.0,0,,0.3,13.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
4,2,2020-04-01 00:00:23,2020-04-01 00:16:13,N,1,75,169,1,6.79,21.0,0.5,0.5,0.0,0,,0.3,22.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
5,2,2020-04-01 00:41:23,,N,1,41,233,1,5.04,16.5,0.5,0.5,0.0,0,,0.3,20.55,1,1,2.75,sampleData,green_tripdata_2020-04.xlsx
6,2,2020-04-01 00:00:52,2020-04-01 00:09:49,N,1,244,127,1,3.27,11.5,0.5,0.5,2.0,0,,0.3,14.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
7,2,2020-04-01 00:51:04,2020-04-01 00:57:25,N,1,244,42,1,1.48,7.0,0.5,0.5,0.0,0,,0.3,8.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
8,2,2020-04-01 00:46:02,2020-04-01 01:05:52,N,4,244,265,1,14.41,42.5,0.5,0.5,,0,,0.3,46.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
9,1,2020-04-01 00:29:05,2020-04-01 00:46:09,N,1,25,71,1,3.6,14.5,0.5,0.5,4.2,0,,0.3,20.0,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx


In [23]:
########### 
# NOTE: Before continuing it is strongly recommended to make any other necessary changes to the column structure of the dataframe.
# Any changes made after the below cell will require the "Col_list" to be regenerated before the INSERT statement is created further down.
###########

In [24]:
# Initialize column list and parameter list
Col_list = []
Param_list = [] # needed for insert into statement
#Get rid of invalid characters in the column names
for i in range(df.shape[1]):   
    # Replace spaces and most special characters with underscore in column names
    Col_Name = df.columns[i].strip().replace("  "," ").replace(" - ","_").replace("-", "_").replace(" ", "_").replace("&", "and").replace("(","").replace(")","").replace("/","_").replace("#","_NUM").replace(":","_").replace("__","_") 
    Col_Name = f'[{Col_Name.upper()}]' # Encase the column name in brackets to prevent issues with reserved words or columns starting with numbers
    # Populate column list with new column names
    Col_list.append(Col_Name)
    # For every column in the dataframe, add "?" to parameter list
    Param_list.append("?")
# Update the dataframe with the adjusted column names
df.columns = Col_list

In [25]:
# Validate contents and structure of dataframe
df.head(15)

Unnamed: 0,[VENDORID],[LPEP_PICKUP_DATETIME],[LPEP_DROPOFF_DATETIME],[STORE_AND_FWD_FLAG],[RATECODEID],[PULOCATIONID],[DOLOCATIONID],[PASSENGER_COUNT],[TRIP_DISTANCE],[FARE_AMOUNT],[EXTRA],[MTA_TAX],[TIP_AMOUNT],[TOLLS_AMOUNT],[EHAIL_FEE],[IMPROVEMENT_SURCHARGE],[TOTAL_AMOUNT],[PAYMENT_TYPE],[TRIP_TYPE],[CONGESTION_SURCHARGE],[AUD_DIRECTORY],[AUD_FILENAME]
0,2,2020-04-01 00:44:02,2020-04-01 00:52:23,N,1,42,41,1,1.68,8.0,0.5,0.5,0.0,0,,0.3,9.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
1,2,2020-04-01 00:24:39,2020-04-01 00:33:06,N,1,244,247,2,1.94,9.0,0.5,0.5,0.0,0,,0.3,10.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
2,2,2020-04-01 00:45:06,,N,1,244,243,3,1.0,6.5,0.5,0.5,0.0,0,,0.3,7.8,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
3,2,2020-04-01 00:45:06,2020-04-01 01:04:39,N,1,244,243,2,,12.0,0.5,0.5,0.0,0,,0.3,13.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
4,2,2020-04-01 00:00:23,2020-04-01 00:16:13,N,1,75,169,1,6.79,21.0,0.5,0.5,0.0,0,,0.3,22.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
5,2,2020-04-01 00:41:23,,N,1,41,233,1,5.04,16.5,0.5,0.5,0.0,0,,0.3,20.55,1,1,2.75,sampleData,green_tripdata_2020-04.xlsx
6,2,2020-04-01 00:00:52,2020-04-01 00:09:49,N,1,244,127,1,3.27,11.5,0.5,0.5,2.0,0,,0.3,14.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
7,2,2020-04-01 00:51:04,2020-04-01 00:57:25,N,1,244,42,1,1.48,7.0,0.5,0.5,0.0,0,,0.3,8.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
8,2,2020-04-01 00:46:02,2020-04-01 01:05:52,N,4,244,265,1,14.41,42.5,0.5,0.5,,0,,0.3,46.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
9,1,2020-04-01 00:29:05,2020-04-01 00:46:09,N,1,25,71,1,3.6,14.5,0.5,0.5,4.2,0,,0.3,20.0,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx


In [26]:
print(df.dtypes)

[VENDORID]                 object
[LPEP_PICKUP_DATETIME]     object
[LPEP_DROPOFF_DATETIME]    object
[STORE_AND_FWD_FLAG]       object
[RATECODEID]               object
[PULOCATIONID]             object
[DOLOCATIONID]             object
[PASSENGER_COUNT]          object
[TRIP_DISTANCE]            object
[FARE_AMOUNT]              object
[EXTRA]                    object
[MTA_TAX]                  object
[TIP_AMOUNT]               object
[TOLLS_AMOUNT]             object
[EHAIL_FEE]                object
[IMPROVEMENT_SURCHARGE]    object
[TOTAL_AMOUNT]             object
[PAYMENT_TYPE]             object
[TRIP_TYPE]                object
[CONGESTION_SURCHARGE]     object
[AUD_DIRECTORY]            object
[AUD_FILENAME]             object
dtype: object


In [27]:
# # Convert certain fields to specific datatypes
# # https://stackoverflow.com/questions/15891038/change-column-type-in-pandas
# # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

In [28]:
# Function to validate dates are in proper format and if so, return the date
# Input: string date in the format YYYY-MM-DD 
# Reference: https://stackoverflow.com/questions/16870663/how-do-i-validate-a-date-string-format-in-python
def validate(date_text):
    try:
        valid_date = dt.fromisoformat(date_text)
        return(valid_date)
    except ValueError:
        raise ValueError("Incorrect data format, should be YYYY-MM-DD")

In [29]:
# Generate potential list of date columns
# Outputs list of data frame columns that contain 'DATE' to review and use in next step
date_cols = [col for col in df.columns if 'DATE' in col]
date_cols

['[LPEP_PICKUP_DATETIME]', '[LPEP_DROPOFF_DATETIME]']

In [30]:
# Ensure values being passed to date columns are within pandas datetime64 bounds
# https://www.statology.org/pandas-out-of-bounds-nanosecond-timestamp/
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations 
# Minimum: 1677-09-21 00:12:43.145224193
# Maximum: 2262-04-11 23:47:16.854775807

######### Replace the column names in this list #################
column_names = ['[LPEP_PICKUP_DATETIME]', '[LPEP_DROPOFF_DATETIME]']

for n in column_names:
    for i in range(df.shape[0]):
        if pd.isna(df[n].iloc[i]) == False: # Excludes NaN values before checking first character
            datetimevar = str(df[n].iloc[i])
            if isinstance(datetimevar, (str, object)): # Confirms datatype before continuing
                # Splits any date time values on the middle whitespace and keeps the date portion
                datevar = datetimevar.split(' ')[0]
                if (validate(datevar) > validate('2262-04-11')) or (validate(datevar) < validate('1677-09-21')):
                    print(f'{df[n].iloc[i]} in column {n} row number {i} is out of bounds')
            


2320-04-01 00:48:09 in column [LPEP_DROPOFF_DATETIME] row number 12 is out of bounds


In [31]:
# View full value at row number returned above
df['[LPEP_DROPOFF_DATETIME]'].iloc[12]

'2320-04-01 00:48:09'

In [32]:
# Convert multiple fields to a specific datatype - apply the change
# Set all date columns with all valid date values to the datetime64[ns] pandas datatype
column_names = ['[LPEP_PICKUP_DATETIME]'
#, '[LPEP_DROPOFF_DATETIME]'  # OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2320-04-01 00:48:09; row number 12; Pandas max allowable datetime = 2262-04-11 23:47:16.854775807
]
datetime_type = "datetime64[ns]"

for i in column_names:
    df = df.astype({i: datetime_type})

In [33]:
# View all columns formatted easily for copy/paste into additional data type blocks
for i in df.columns:
    print(",'"+i+"'")

,'[VENDORID]'
,'[LPEP_PICKUP_DATETIME]'
,'[LPEP_DROPOFF_DATETIME]'
,'[STORE_AND_FWD_FLAG]'
,'[RATECODEID]'
,'[PULOCATIONID]'
,'[DOLOCATIONID]'
,'[PASSENGER_COUNT]'
,'[TRIP_DISTANCE]'
,'[FARE_AMOUNT]'
,'[EXTRA]'
,'[MTA_TAX]'
,'[TIP_AMOUNT]'
,'[TOLLS_AMOUNT]'
,'[EHAIL_FEE]'
,'[IMPROVEMENT_SURCHARGE]'
,'[TOTAL_AMOUNT]'
,'[PAYMENT_TYPE]'
,'[TRIP_TYPE]'
,'[CONGESTION_SURCHARGE]'
,'[AUD_DIRECTORY]'
,'[AUD_FILENAME]'


In [34]:
# Convert multiple fields to a specific datatype - apply the change
# Set columns that should be decimals (like money columns) to float64

# Use knowledge of dataset and results of profiling to populate this column list

column_names = ['[TRIP_DISTANCE]'
,'[FARE_AMOUNT]'
,'[EXTRA]'
,'[MTA_TAX]'
,'[TIP_AMOUNT]'
,'[TOLLS_AMOUNT]'
,'[EHAIL_FEE]'
,'[IMPROVEMENT_SURCHARGE]'
,'[TOTAL_AMOUNT]'
,'[CONGESTION_SURCHARGE]']
num_type = "float64"

for i in column_names:
    df = df.astype({i: num_type})


In [35]:
# Convert multiple fields to a specific datatype - apply the change
# Set columns that should be int/bigint to Int64

# Use knowledge of dataset and results of profiling to populate this column list

column_names = ['[VENDORID]'
    ,'[RATECODEID]'
    ,'[PULOCATIONID]'
    ,'[DOLOCATIONID]'
    ,'[PASSENGER_COUNT]']
num_type = "Int64"

for i in column_names:
    df = df.astype({i: num_type})

In [36]:
# Validate structure and contents of dataframe
df.head(5)

Unnamed: 0,[VENDORID],[LPEP_PICKUP_DATETIME],[LPEP_DROPOFF_DATETIME],[STORE_AND_FWD_FLAG],[RATECODEID],[PULOCATIONID],[DOLOCATIONID],[PASSENGER_COUNT],[TRIP_DISTANCE],[FARE_AMOUNT],[EXTRA],[MTA_TAX],[TIP_AMOUNT],[TOLLS_AMOUNT],[EHAIL_FEE],[IMPROVEMENT_SURCHARGE],[TOTAL_AMOUNT],[PAYMENT_TYPE],[TRIP_TYPE],[CONGESTION_SURCHARGE],[AUD_DIRECTORY],[AUD_FILENAME]
0,2,2020-04-01 00:44:02,2020-04-01 00:52:23,N,1,42,41,1,1.68,8.0,0.5,0.5,0.0,0.0,,0.3,9.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx
1,2,2020-04-01 00:24:39,2020-04-01 00:33:06,N,1,244,247,2,1.94,9.0,0.5,0.5,0.0,0.0,,0.3,10.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
2,2,2020-04-01 00:45:06,,N,1,244,243,3,1.0,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
3,2,2020-04-01 00:45:06,2020-04-01 01:04:39,N,1,244,243,2,,12.0,0.5,0.5,0.0,0.0,,0.3,13.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx
4,2,2020-04-01 00:00:23,2020-04-01 00:16:13,N,1,75,169,1,6.79,21.0,0.5,0.5,0.0,0.0,,0.3,22.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx


In [37]:
# Review non-object datatypes set above
print(df.dtypes[df.dtypes != 'object'])

[VENDORID]                          Int64
[LPEP_PICKUP_DATETIME]     datetime64[ns]
[RATECODEID]                        Int64
[PULOCATIONID]                      Int64
[DOLOCATIONID]                      Int64
[PASSENGER_COUNT]                   Int64
[TRIP_DISTANCE]                   float64
[FARE_AMOUNT]                     float64
[EXTRA]                           float64
[MTA_TAX]                         float64
[TIP_AMOUNT]                      float64
[TOLLS_AMOUNT]                    float64
[EHAIL_FEE]                       float64
[IMPROVEMENT_SURCHARGE]           float64
[TOTAL_AMOUNT]                    float64
[CONGESTION_SURCHARGE]            float64
dtype: object


In [38]:
# Generate the column definitions for a CREATE TABLE statement based on the final dataframe columns and data types
Text_list = []
for i in range(df.shape[1]):   
    Col_Name = df.columns[i]
    P_type = df.dtypes.iloc[i] # Get the datatype of the column
    Col_maxlen = df[Col_Name].astype(str).str.len().max() # Get the max length of the column and add a 10 character buffer
    # Set SQL datatype to FLOAT or DECIMAL(p,s) if pandas datatype is float64
    if str(P_type) == 'float64':
        #c_type = "FLOAT"        # Uncomment this is the columns set to float64 should be set to SQL FLOAT datatype instead of DECIMAL
        c_type = "DECIMAL(18,2)" # Custom set float64 datatypes to DECIMAL(18,2) - Adjust precision and scale based on data
    # Set SQL datatype to DATETIME2 where the pandas datatype is datetime64[ns]
    elif P_type == 'datetime64[ns]':
        c_type = 'DATETIME2'
    # Set SQL datatype to BIGINT if pandas datatype is Int64 and the max length is less than 15 characters/digits
    elif P_type == 'Int64':
        if Col_maxlen < 15:
            c_type = "BIGINT"
    # Set remaining columns to NVARCHAR(n) where n is based on max length of the column
    else:
        len_str="256" # Initializes the len_str variable and sets to 256 by default
        if(Col_maxlen<20):
            len_str="20"
        elif(Col_maxlen<50):
            len_str="50"
        elif(Col_maxlen<256):
            len_str="256"
        elif(Col_maxlen<1024):
            len_str="1024"
        else:
            len_str="max"
        c_type = "NVARCHAR("+len_str+")"
    Text_list.append(Col_Name)
    Text_list.append(' ')
    Text_list.append(c_type)
    if i < (df.shape[1] - 1): 
        Text_list.append(",\n")
Text_Block = ''.join(Text_list)

In [39]:
# Assembles and prints the CREATE TABLE statement with additional audit columns AUD_SEQ_ID and AUD_INSRT_TMSTP

print('-----------Create Table Statement------------')

drop_str = f'DROP TABLE IF EXISTS {table_name};\n\n'

create_str="CREATE TABLE " + table_name + "("

## Note: SUSER_SNAME() returns the login name of the user that is currently connected to SQL Server. Commented block returns full username, uncommented block returns
##    username without the domain prefix (e.g. DOMAIN\username becomes username)

# create_fullstr = drop_str + create_str + """
# {},\n[AUD_SEQ_ID] BIGINT PRIMARY KEY IDENTITY(1,1),\n[AUD_INSRT_TMSTP] DATETIME2 DEFAULT SYSDATETIME(),\n[AUD_INSRT_USER] NVARCHAR(200) DEFAULT SUSER_SNAME()\n);
# """.format(Text_Block) +""

create_fullstr = drop_str + create_str + """
{},\n[AUD_SEQ_ID] BIGINT PRIMARY KEY IDENTITY(1,1),\n[AUD_INSRT_TMSTP] DATETIME2 DEFAULT SYSDATETIME(),\n[AUD_INSRT_USER] NVARCHAR(200) DEFAULT stuff(suser_sname(), 1, charindex('\\', suser_sname()), '')\n);
""".format(Text_Block) +""

print(create_fullstr)

-----------Create Table Statement------------
DROP TABLE IF EXISTS [dbo].[green_tripdata_2020-04_Excel];

CREATE TABLE [dbo].[green_tripdata_2020-04_Excel](
[VENDORID] BIGINT,
[LPEP_PICKUP_DATETIME] DATETIME2,
[LPEP_DROPOFF_DATETIME] NVARCHAR(20),
[STORE_AND_FWD_FLAG] NVARCHAR(20),
[RATECODEID] BIGINT,
[PULOCATIONID] BIGINT,
[DOLOCATIONID] BIGINT,
[PASSENGER_COUNT] BIGINT,
[TRIP_DISTANCE] DECIMAL(18,2),
[FARE_AMOUNT] DECIMAL(18,2),
[EXTRA] DECIMAL(18,2),
[MTA_TAX] DECIMAL(18,2),
[TIP_AMOUNT] DECIMAL(18,2),
[TOLLS_AMOUNT] DECIMAL(18,2),
[EHAIL_FEE] DECIMAL(18,2),
[IMPROVEMENT_SURCHARGE] DECIMAL(18,2),
[TOTAL_AMOUNT] DECIMAL(18,2),
[PAYMENT_TYPE] NVARCHAR(20),
[TRIP_TYPE] NVARCHAR(20),
[CONGESTION_SURCHARGE] DECIMAL(18,2),
[AUD_DIRECTORY] NVARCHAR(20),
[AUD_FILENAME] NVARCHAR(50),
[AUD_SEQ_ID] BIGINT PRIMARY KEY IDENTITY(1,1),
[AUD_INSRT_TMSTP] DATETIME2 DEFAULT SYSDATETIME(),
[AUD_INSRT_USER] NVARCHAR(200) DEFAULT stuff(suser_sname(), 1, charindex('\', suser_sname()), '')
);



In [40]:
# Replace any column definitions that weren't properly generated by above steps, or couldn't be based on pandas limitations (like the out of range date column)
create_fullstr = create_fullstr.replace('[LPEP_DROPOFF_DATETIME] NVARCHAR(20),', '[LPEP_DROPOFF_DATETIME] DATETIME2,')
print(create_fullstr)

DROP TABLE IF EXISTS [dbo].[green_tripdata_2020-04_Excel];

CREATE TABLE [dbo].[green_tripdata_2020-04_Excel](
[VENDORID] BIGINT,
[LPEP_PICKUP_DATETIME] DATETIME2,
[LPEP_DROPOFF_DATETIME] DATETIME2,
[STORE_AND_FWD_FLAG] NVARCHAR(20),
[RATECODEID] BIGINT,
[PULOCATIONID] BIGINT,
[DOLOCATIONID] BIGINT,
[PASSENGER_COUNT] BIGINT,
[TRIP_DISTANCE] DECIMAL(18,2),
[FARE_AMOUNT] DECIMAL(18,2),
[EXTRA] DECIMAL(18,2),
[MTA_TAX] DECIMAL(18,2),
[TIP_AMOUNT] DECIMAL(18,2),
[TOLLS_AMOUNT] DECIMAL(18,2),
[EHAIL_FEE] DECIMAL(18,2),
[IMPROVEMENT_SURCHARGE] DECIMAL(18,2),
[TOTAL_AMOUNT] DECIMAL(18,2),
[PAYMENT_TYPE] NVARCHAR(20),
[TRIP_TYPE] NVARCHAR(20),
[CONGESTION_SURCHARGE] DECIMAL(18,2),
[AUD_DIRECTORY] NVARCHAR(20),
[AUD_FILENAME] NVARCHAR(50),
[AUD_SEQ_ID] BIGINT PRIMARY KEY IDENTITY(1,1),
[AUD_INSRT_TMSTP] DATETIME2 DEFAULT SYSDATETIME(),
[AUD_INSRT_USER] NVARCHAR(200) DEFAULT stuff(suser_sname(), 1, charindex('\', suser_sname()), '')
);



In [41]:
# Change NaN, N/A, etc to "None" which SQL Server will interpret as NULL
# https://stackoverflow.com/questions/14162723/replacing-pandas-or-numpy-nan-with-a-none-to-use-with-mysqldb
# Select the statement that has the intended effect
#df = df.fillna('') # Best for when loading as all 'object' types to blank out NaN values
#df = df.replace('NaN',None) 
#df = df.where(df.notnull(),None) # May cause affected column types to be set to 'object'
df = df.replace({np.nan: None}) # May cause affected column types to be set to 'object'

In [42]:
# Create Insert Statement
insert_str="INSERT INTO " + table_name + "(" + ",".join(Col_list) + ")" + " VALUES(" + ",".join(Param_list) + ")"
# Print for visual confirmation
print(insert_str)

INSERT INTO [dbo].[green_tripdata_2020-04_Excel]([VENDORID],[LPEP_PICKUP_DATETIME],[LPEP_DROPOFF_DATETIME],[STORE_AND_FWD_FLAG],[RATECODEID],[PULOCATIONID],[DOLOCATIONID],[PASSENGER_COUNT],[TRIP_DISTANCE],[FARE_AMOUNT],[EXTRA],[MTA_TAX],[TIP_AMOUNT],[TOLLS_AMOUNT],[EHAIL_FEE],[IMPROVEMENT_SURCHARGE],[TOTAL_AMOUNT],[PAYMENT_TYPE],[TRIP_TYPE],[CONGESTION_SURCHARGE],[AUD_DIRECTORY],[AUD_FILENAME]) VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)


In [43]:
# Set and confirm ending records counts
rowcount2 = df.shape[0]
print('Started with {} and ended with {} records'.format(rowcount1, rowcount2))

# Set and confirm ending column count
# +2 columns are expected due to the AUD_DIRECTORY and AUD_FILENAME columns being added
colcount2 = df.shape[1]
print('Started with {} and ended with {} columns'.format(colcount1, colcount2))

Started with 35612 and ended with 35612 records
Started with 20 and ended with 22 columns


In [44]:
# Create/recreate the table

# Initialize new SQL Server cursor
mssql_cursor = mssql_conn.cursor()
# Execute the drop/create table statement
mssql_cursor.execute(create_fullstr)
# Commit change
mssql_conn.commit()
# Close cursor
mssql_cursor.close()

In [45]:
# Load the data via pandas
# This option is faster, but if there are issues with pandas data types loading into 
#   SQL Server data types use the csv_reader method below

# Initialize new SQL Server cursor
mssql_cursor = mssql_conn.cursor()
# Enable fast execute on the cursor (Greatly speeds up load)
mssql_cursor.fast_executemany = True
# Execute the insert statements for each record in the dataframe
mssql_cursor.executemany(insert_str,df.values.tolist())
# Commit the change
mssql_cursor.commit()

print(f'Data loaded on {dttm.now(timezone.utc)} UTC')

Data loaded on 2025-06-25 00:19:57.134157+00:00 UTC


In [46]:
## Load the data via csv_reader

# import csv

# mssql_cursor = mssql_conn.cursor()
# with open(filePath + fileName, 'r', encoding='utf-8') as file:
#     # Create a CSV reader object
#     csv_reader = csv.reader(file)
    
#     # Skip the header row
#     next(csv_reader)

#     rows = []  # List to hold rows for bulk insert
#     contracts_file_row_count = 0
#     for row in csv_reader:
#             cleaned_row = [None if value == '' else value for value in row]
#             cleaned_row.append(fileName)
#             rows.append(cleaned_row)
#             #rows.extend([cleaned_row, fileName])
#             #print(rows)
#             contracts_file_row_count += 1
#             #if contracts_file_row_count == 1:
#             #      break

                       
#     # insert rows into table
#     mssql_cursor.executemany(insert_str, rows)
#     mssql_conn.commit()

# print(f'Data loaded on {dttm.now(timezone.utc)} UTC')

In [47]:
# Query top 10 records of SQL Server table to ensure data loaded correctly
# mssql_cursor = mssql_conn.cursor() # Uncomment this line if cursor opened above closed
top_ten_query = f"select top 10 * from {table_name}"
confirmQuery = mssql_cursor.execute(top_ten_query).fetchall()
for row in confirmQuery:
    print('%r' % (row,))

(2, datetime.datetime(2020, 4, 1, 0, 44, 2), datetime.datetime(2020, 4, 1, 0, 52, 23), 'N', 1, 42, 41, 1, Decimal('1.68'), Decimal('8.00'), Decimal('0.50'), Decimal('0.50'), Decimal('0.00'), Decimal('0.00'), None, Decimal('0.30'), Decimal('9.30'), '1', '1', Decimal('0.00'), 'sampleData', 'green_tripdata_2020-04.xlsx', 1, datetime.datetime(2025, 6, 25, 0, 19, 48, 560905), 'WilshereJ')
(2, datetime.datetime(2020, 4, 1, 0, 24, 39), datetime.datetime(2020, 4, 1, 0, 33, 6), 'N', 1, 244, 247, 2, Decimal('1.94'), Decimal('9.00'), Decimal('0.50'), Decimal('0.50'), Decimal('0.00'), Decimal('0.00'), None, Decimal('0.30'), Decimal('10.30'), '2', '1', Decimal('0.00'), 'sampleData', 'green_tripdata_2020-04.xlsx', 2, datetime.datetime(2025, 6, 25, 0, 19, 48, 560905), 'WilshereJ')
(2, datetime.datetime(2020, 4, 1, 0, 45, 6), None, 'N', 1, 244, 243, 3, Decimal('1.00'), Decimal('6.50'), Decimal('0.50'), Decimal('0.50'), Decimal('0.00'), Decimal('0.00'), None, Decimal('0.30'), Decimal('7.80'), '2', '1',

In [48]:
# Query top 10 records of SQL Server table into dataframe to ensure data loaded correctly in a more readable format
with engine.begin() as conn:
    df_top_ten = pd.read_sql_query(sa.text(top_ten_query), conn)
df_top_ten

Unnamed: 0,VENDORID,LPEP_PICKUP_DATETIME,LPEP_DROPOFF_DATETIME,STORE_AND_FWD_FLAG,RATECODEID,PULOCATIONID,DOLOCATIONID,PASSENGER_COUNT,TRIP_DISTANCE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,EHAIL_FEE,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,PAYMENT_TYPE,TRIP_TYPE,CONGESTION_SURCHARGE,AUD_DIRECTORY,AUD_FILENAME,AUD_SEQ_ID,AUD_INSRT_TMSTP,AUD_INSRT_USER
0,2,2020-04-01 00:44:02,2020-04-01 00:52:23,N,1,42,41,1,1.68,8.0,0.5,0.5,0.0,0.0,,0.3,9.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx,1,2025-06-25 00:19:48.560905,WilshereJ
1,2,2020-04-01 00:24:39,2020-04-01 00:33:06,N,1,244,247,2,1.94,9.0,0.5,0.5,0.0,0.0,,0.3,10.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx,2,2025-06-25 00:19:48.560905,WilshereJ
2,2,2020-04-01 00:45:06,NaT,N,1,244,243,3,1.0,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx,3,2025-06-25 00:19:48.560905,WilshereJ
3,2,2020-04-01 00:45:06,2020-04-01 01:04:39,N,1,244,243,2,,12.0,0.5,0.5,0.0,0.0,,0.3,13.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx,4,2025-06-25 00:19:48.560905,WilshereJ
4,2,2020-04-01 00:00:23,2020-04-01 00:16:13,N,1,75,169,1,6.79,21.0,0.5,0.5,0.0,0.0,,0.3,22.3,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx,5,2025-06-25 00:19:48.560905,WilshereJ
5,2,2020-04-01 00:41:23,NaT,N,1,41,233,1,5.04,16.5,0.5,0.5,0.0,0.0,,0.3,20.55,1,1,2.75,sampleData,green_tripdata_2020-04.xlsx,6,2025-06-25 00:19:48.560905,WilshereJ
6,2,2020-04-01 00:00:52,2020-04-01 00:09:49,N,1,244,127,1,3.27,11.5,0.5,0.5,2.0,0.0,,0.3,14.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx,7,2025-06-25 00:19:48.560905,WilshereJ
7,2,2020-04-01 00:51:04,2020-04-01 00:57:25,N,1,244,42,1,1.48,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2,1,0.0,sampleData,green_tripdata_2020-04.xlsx,8,2025-06-25 00:19:48.560905,WilshereJ
8,2,2020-04-01 00:46:02,2020-04-01 01:05:52,N,4,244,265,1,14.41,42.5,0.5,0.5,,0.0,,0.3,46.8,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx,9,2025-06-25 00:19:48.560905,WilshereJ
9,1,2020-04-01 00:29:05,2020-04-01 00:46:09,N,1,25,71,1,3.6,14.5,0.5,0.5,4.2,0.0,,0.3,20.0,1,1,0.0,sampleData,green_tripdata_2020-04.xlsx,10,2025-06-25 00:19:48.560905,WilshereJ


In [49]:
# Confirm records counts
print(df.shape[0])

confirmSQLServerRowCount = mssql_cursor.execute(f"select count(*) from {table_name}").fetchall()
print(confirmSQLServerRowCount)

35612
[(35612,)]


In [50]:
# Create a profile for the table
mssql_cursor.execute(f"EXEC meta.ProfileSingleTable @SCHEMA = '{schema_name}', @TABLE = '{base_table_name}'")
mssql_conn.commit()

In [51]:
# Construct the query of the meta.TABLE_PROFILES table to get the profile of the table with data typed columns
typedProfileQuery = f"SELECT * FROM meta.TABLE_PROFILES WHERE SCHEMA_NAME = \'{schema_name}\' AND TABLE_NAME = \'{base_table_name}\'"
print(typedProfileQuery)

SELECT * FROM meta.TABLE_PROFILES WHERE SCHEMA_NAME = 'dbo' AND TABLE_NAME = 'green_tripdata_2020-04_Excel'


In [52]:
# Read in and display the table profile of the all data typed table
with engine.begin() as conn:
    df_typedprofile = pd.read_sql_query(sa.text(typedProfileQuery), conn)
df_typedprofile

Unnamed: 0,SCHEMA_NAME,TABLE_NAME,COLUMN_ID,COLUMN_NAME,DATATYPE,MAX_LEN,MAX_LEN_VALUE,MIN_LEN,NULL_COUNT,BLANK_COUNT,NUMERIC_VALUE_COUNT,MAX_PRECISION,MAX_SCALE,DISTINCT_COUNT,TOTAL_RECORD_COUNT
0,dbo,green_tripdata_2020-04_Excel,1,[VENDORID],BIGINT,1,2,1,11110,0,,,,2,35612
1,dbo,green_tripdata_2020-04_Excel,2,[LPEP_PICKUP_DATETIME],DATETIME2,27,2020-04-01 00:44:02.0000000,27,0,0,,,,32973,35612
2,dbo,green_tripdata_2020-04_Excel,3,[LPEP_DROPOFF_DATETIME],DATETIME2,27,2020-04-01 00:52:23.0000000,27,2,0,,,,32981,35612
3,dbo,green_tripdata_2020-04_Excel,4,[STORE_AND_FWD_FLAG],NVARCHAR(20),1,N,1,11110,0,0.0,,,2,35612
4,dbo,green_tripdata_2020-04_Excel,5,[RATECODEID],BIGINT,1,1,1,11110,0,,,,6,35612
5,dbo,green_tripdata_2020-04_Excel,6,[PULOCATIONID],BIGINT,3,244,1,0,0,,,,235,35612
6,dbo,green_tripdata_2020-04_Excel,7,[DOLOCATIONID],BIGINT,3,247,1,0,0,,,,249,35612
7,dbo,green_tripdata_2020-04_Excel,8,[PASSENGER_COUNT],BIGINT,1,1,1,11110,0,,,,9,35612
8,dbo,green_tripdata_2020-04_Excel,9,[TRIP_DISTANCE],"DECIMAL(18,2)",8,26471.58,4,1,0,,,,2568,35612
9,dbo,green_tripdata_2020-04_Excel,10,[FARE_AMOUNT],"DECIMAL(18,2)",6,314.50,4,0,0,,,,4043,35612


In [53]:
# Read in the stored procedure script to compare the two profiles
with open('../Data Profiling/Create Stored Procedure meta.CompareCharAndTypedTableProfiles.sql', 'rb') as f:
    create_compare_profile_sp = f.read().decode('utf-8')

In [54]:
# Create or alter the meta.CompareCharAndTypedTableProfiles stored procedure
mssql_cursor.execute(create_compare_profile_sp)
mssql_conn.commit()

In [55]:
compare_query = f"EXEC meta.CompareCharAndTypedTableProfiles @ALLCHAR_SCHEMA_NAME = '{schema_name}', @ALLCHAR_TABLE_NAME = '{allchar_table_name}', @TYPED_SCHEMA_NAME = '{schema_name}', @TYPED_TABLE_NAME = '{base_table_name}'"
print(compare_query)

EXEC meta.CompareCharAndTypedTableProfiles @ALLCHAR_SCHEMA_NAME = 'dbo', @ALLCHAR_TABLE_NAME = 'green_tripdata_2020-04_Excel_AllChar', @TYPED_SCHEMA_NAME = 'dbo', @TYPED_TABLE_NAME = 'green_tripdata_2020-04_Excel'


In [None]:
# Read in and display the table profile comparison results

# !!!!Important Note!!!! - The [char_DISTINCT_COUNT] will be 1 higher than the [type_DISTINCT_COUNT] if the column had blanks in the _AllChar table
# that were converted to NULL in the datatyped table, as '' is a counted a value in DISTINCT() but NULL is not.

with engine.begin() as conn:
    df_profilecompare = pd.read_sql_query(sa.text(compare_query), conn)
df_profilecompare

Unnamed: 0,char_COLUMN_NAME,type_COLUMN_NAME,char_DATATYPE,type_DATATYPE,char_MAX_LEN,type_MAX_LEN,char_MAX_LEN_VALUE,type_MAX_LEN_VALUE,char_MIN_LEN,type_MIN_LEN,char_NULL_COUNT,type_NULL_COUNT,char_BLANK_COUNT,type_BLANK_COUNT,char_NUMERIC_VALUE_COUNT,type_NUMERIC_VALUE_COUNT,char_MAX_PRECISION,type_MAX_PRECISION,char_MAX_SCALE,type_MAX_SCALE,char_DISTINCT_COUNT,type_DISTINCT_COUNT,char_TOTAL_RECORD_COUNT,type_TOTAL_RECORD_COUNT
0,[VENDORID],[VENDORID],NVARCHAR(20),BIGINT,1,1,2,2,0,1,0,11110,11110,0,24502.0,,,,,,3,2,35612,35612
1,[LPEP_PICKUP_DATETIME],[LPEP_PICKUP_DATETIME],NVARCHAR(20),DATETIME2,19,27,2020-04-01 00:44:02,2020-04-01 00:44:02.0000000,19,27,0,0,0,0,0.0,,,,,,32973,32973,35612,35612
2,[LPEP_DROPOFF_DATETIME],[LPEP_DROPOFF_DATETIME],NVARCHAR(20),DATETIME2,19,27,2020-04-01 00:52:23,2020-04-01 00:52:23.0000000,0,27,0,2,2,0,0.0,,,,,,32982,32981,35612,35612
3,[STORE_AND_FWD_FLAG],[STORE_AND_FWD_FLAG],NVARCHAR(20),NVARCHAR(20),1,1,N,N,0,1,0,11110,11110,0,0.0,0.0,,,,,3,2,35612,35612
4,[RATECODEID],[RATECODEID],NVARCHAR(20),BIGINT,1,1,1,1,0,1,0,11110,11110,0,24502.0,,,,,,7,6,35612,35612
5,[PULOCATIONID],[PULOCATIONID],NVARCHAR(20),BIGINT,3,3,244,244,1,1,0,0,0,0,35612.0,,,,,,235,235,35612,35612
6,[DOLOCATIONID],[DOLOCATIONID],NVARCHAR(20),BIGINT,3,3,247,247,1,1,0,0,0,0,35612.0,,,,,,249,249,35612,35612
7,[PASSENGER_COUNT],[PASSENGER_COUNT],NVARCHAR(20),BIGINT,1,1,1,1,0,1,0,11110,11110,0,24502.0,,,,,,10,9,35612,35612
8,[TRIP_DISTANCE],[TRIP_DISTANCE],NVARCHAR(20),"DECIMAL(18,2)",8,8,26471.58,26471.58,0,4,0,1,1,0,35611.0,,7.0,,2.0,,2569,2568,35612,35612
9,[FARE_AMOUNT],[FARE_AMOUNT],NVARCHAR(20),"DECIMAL(18,2)",6,6,-24.46,314.50,1,4,0,0,0,0,35612.0,,4.0,,2.0,,4043,4043,35612,35612


In [57]:
# Close any open cursors and connections
for retry in range(1):
    try:
        mssql_cursor.close()
    except:
        print('SQL Server cursor already closed')
for retry in range(1):
    try:
        mssql_conn.close()
    except:
        print('SQL Server connection already closed')