# Data type sizes - a western myth

## Test Latin character strings with Latin collation

**Note:** My server default is SQL_Latin1_General_CP1_CI_AS

Set size limit of data types to be the same under Basic Multilingual Plane (BMP)
Characters: ranging from 1-byte (ASCII) to 3-bytes (East Asian) per character. So a max of 24-bytes for an East Asian 8 character string.

In [15]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (c1 varchar(24) COLLATE Latin1_General_100_CI_AI, 
	c2 nvarchar(8) COLLATE Latin1_General_100_CI_AI);  
INSERT INTO t1 VALUES (N'MyString', N'MyString')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t1;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t1;
GO

varchar LEN,varchar DATALENGTH,c1
8,8,MyString


nvarchar LEN,nvarchar DATALENGTH,c2
8,16,MyString


That's as expected on bothe cases. So what was I talking about?

Run next example with Chinese characters.

# Test Chinese character strings with Latin collation

In [16]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (c1 varchar(24) COLLATE Latin1_General_100_CI_AI, 
	c2 nvarchar(8) COLLATE Latin1_General_100_CI_AI);  
INSERT INTO t1 VALUES (N'敏捷的棕色狐狸跳', N'敏捷的棕色狐狸跳')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t1;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t1;
GO

varchar LEN,varchar DATALENGTH,c1
8,8,????????


nvarchar LEN,nvarchar DATALENGTH,c2
8,16,敏捷的棕色狐狸跳


uh-oh data loss on the varchar example. Why?

Varchar is bound to code page encoding by default, and these code points cannot be found in the Latin code page.

But why didn't it happen in the nvarchar example? 

These Chinese characters are double-byte and within the *Basic Multilingual Plane* (BMP), and nvarchar with this non-SC collation encodes in UCS-2 (BMP), not the code page.

Run the next example:

In [4]:
USE UnicodeDatabase
GO
SELECT ASCII('敏' COLLATE Latin1_General_100_CI_AI), CHAR(63);
SELECT ASCII('捷' COLLATE Latin1_General_100_CI_AI), CHAR(63);

(No column name),(No column name).1
63,?


(No column name),(No column name).1
63,?


The ASCII function returns the ASCII code value of the leftmost character of a character expression. We know the Latin code page that's chosen can't represent a double-byte character, so it can only read the first byte, which is incorrectly translated to code point 63. Using the CHAR function, we see that the 63 code point is a question mark character. 

Run the next example:

In [5]:
USE UnicodeDatabase
GO
SELECT UNICODE(N'敏' COLLATE Latin1_General_100_CI_AI), NCHAR(25935);
SELECT UNICODE(N'捷' COLLATE Latin1_General_100_CI_AI), NCHAR(25463);

(No column name),(No column name).1
25935,敏


(No column name),(No column name).1
25463,捷


Works irrespective of collation now. By adding the N prefix we force the use of a [Unicode constant](https://docs.microsoft.com/sql/t-sql/data-types/constants-transact-sql#unicode-strings), and collation only sets linguistic algorithms (Compare = sort; Case sensitivity = Upper/Lowercase), not the encoding. Using the UNICODE function, we can correctly identify the code points for the string characters, which the NCHAR function can represent accurately.

## Now test Chinese character strings with Chinese collation

In [17]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (c1 varchar(24) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI, 
	c2 nvarchar(8) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI);  
INSERT INTO t2 VALUES (N'敏捷的棕色狐狸跳', N'敏捷的棕色狐狸跳')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t2;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t2;
GO

varchar LEN,varchar DATALENGTH,c1
8,16,敏捷的棕色狐狸跳


nvarchar LEN,nvarchar DATALENGTH,c2
8,16,敏捷的棕色狐狸跳


Now the varchar example is correct because the code page can recognize Chinese characters. But there's 2 bytes per character, not 3?...

**Myth buster:** code page defines string length for varchar. Varchar is **not** always 1 byte per character. 

Ok, but wasn't East-Asian 3 bytes? Yes, with UTF-8, but under Chinese collation code page, they are encoded using 2 bytes just like UCS-2/UTF-16


## Test with Supplementary Characters (4 bytes)

In [20]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (c1 varchar(24) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC, 
	c2 nvarchar(8) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC);  
INSERT INTO t2 VALUES (N'👶👦👧👨👩👴👵👨', N'👶👦👧👨👩👴👵👨')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t2;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t2;
GO

: Msg 2628, Level 16, State 1, Line 4
String or binary data would be truncated in table 'master.dbo.t2', column 'c2'. Truncated value: '👶👦👧👨'.

varchar LEN,varchar DATALENGTH,c1


nvarchar LEN,nvarchar DATALENGTH,c2


uh-oh, let's set the proper data type length from 8 to 16 byte-pairs (so a 32-byte encoding limit)

In [21]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (c1 varchar(24) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC, 
	c2 nvarchar(16) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC);  
INSERT INTO t2 VALUES (N'👶👦👧👨👩👴👵👨', N'👶👦👧👨👩👴👵👨')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t2;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t2;
GO

varchar LEN,varchar DATALENGTH,c1
16,16,????????????????


nvarchar LEN,nvarchar DATALENGTH,c2
8,32,👶👦👧👨👩👴👵👨


Nvarchar looks good. But varchar still doesn't encode? 

Set a larger data type length. For example double from 24 to 48 bytes. Now try again:

In [3]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (c1 varchar(48) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC_UTF8, 
	c2 nvarchar(16) COLLATE Chinese_Traditional_Stroke_Order_100_CI_AI_SC);  
INSERT INTO t2 VALUES (N'👶👦👧👨👩👴👵👨', N'👶👦👧👨👩👴👵👨')  
SELECT LEN(c1) AS [varchar LEN],  
	DATALENGTH(c1) AS [varchar DATALENGTH], c1
FROM t2;  
SELECT LEN(c2) AS [nvarchar LEN], 
	DATALENGTH(c2) AS [nvarchar DATALENGTH], c2 
FROM t2;
GO

varchar LEN,varchar DATALENGTH,c1
8,32,👶👦👧👨👩👴👵👨


nvarchar LEN,nvarchar DATALENGTH,c2
8,32,👶👦👧👨👩👴👵👨


Finally!

What if I needed all these characters in one database? Easy, I could just use nvarchar which encodes in UTF-16.

In [23]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t3;
CREATE TABLE t3 (c1 nvarchar(110) COLLATE Latin1_General_100_CI_AI_SC);  
INSERT INTO t3 VALUES (N'MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii敏捷的棕色狐狸跳👶👦')  
SELECT LEN(c1) AS [nvarchar UTF16 LEN],  
	DATALENGTH(c1) AS [nvarchar UTF16 DATALENGTH], c1
FROM t3; 
GO

nvarchar UTF16 LEN,nvarchar UTF16 DATALENGTH,c1
65,134,MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii敏捷的棕色狐狸跳👶👦


But wait. The majority of my data is set to Latin (ASCII), can we do better?

In [1]:
USE UnicodeDatabase
GO
DROP TABLE IF EXISTS t4;
CREATE TABLE t4 (c1 varchar(110) COLLATE Latin1_General_100_CI_AI_SC_UTF8);  
INSERT INTO t4 VALUES (N'MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii敏捷的棕色狐狸跳👶👦')  
SELECT LEN(c1) AS [varchar UTF8 LEN],  
	DATALENGTH(c1) AS [varchar UTF8 DATALENGTH], c1
FROM t4; 
GO

varchar UTF8 LEN,varchar UTF8 DATALENGTH,c1
65,87,MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii敏捷的棕色狐狸跳👶👦


With this data pattern the savings are obvious. Where are the savings? Let's compare breaking down to individual Latin, Chinese, and Emoji strings.

In [2]:
USE UnicodeDatabase
GO
SELECT DATALENGTH(N'MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii') AS [Latin_UTF16_2bytes], 
	DATALENGTH(N'敏捷的棕色狐狸跳') AS [Chinese_UTF16_2bytes], 
	DATALENGTH(N'👶👦') AS [SC_UTF16_4bytes]
SELECT DATALENGTH('MyStringThequickbrownfoxjumpsoverthelazydogIsLatinAscii' COLLATE Latin1_General_100_CI_AI_SC_UTF8) AS [Latin_UTF8_1byte], 
	DATALENGTH('敏捷的棕色狐狸跳' COLLATE Latin1_General_100_CI_AI_SC_UTF8) AS [Chinese_UTF8_3bytes], 
	DATALENGTH('👶👦' COLLATE Latin1_General_100_CI_AI_SC_UTF8) AS [SC_UTF8_4bytes]
GO

Latin_UTF16_2bytes,Chinese_UTF16_2bytes,SC_UTF16_4bytes
110,16,8


Latin_UTF8_1byte,Chinese_UTF8_3bytes,SC_UTF8_4bytes
55,24,8
