<h2>Data Transformation</h2>
 <ul>
    <li>Introduction</li>
    <li>Data Preprocessing - Framework</li>
    <li>String Manipulation functions</li>
    <li>Data transformation control statements</li>
 </ul>

<h3>1. Introduction</h3>

In [5]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('../chinook.db')
sql_query = 'SELECT * FROM albums'

df = pd.read_sql_query(sql_query,conn)
conn.close() #close connection

df.head()

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3


In [9]:
df.to_csv('my_datasets.csv',index=False) #convert it to dataset (csv)
df.to_excel('my_datasets.xlsx',index=False) #convert it to dataset (excel sheet)

<h3 style="text-align:center">Data Processing Context</h3>
<img src="../media/data_preprocessing.png" width="700px"/>

<h3>2. Data Preprocessing Framework</h3>

<img src="../media/Data_Imputation_Framework.png" width="800px"/>

<h3>3. String Manipulation Functions</h3>

In [None]:
%load_ext sql

In [12]:
%%sql
sqlite:///../Students.db

In [11]:
%%sql
SELECT name FROM sqlite_master WHERE type='table'

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


name
Students
sqlite_sequence


In [16]:
%%sql
SELECT * FROM Students 
LIMIT 10

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


AdmissionNo,Name,Surname,IDNumber
1,Jan,Makhanya,#820410-5405-084#
2,Dumisani,Morris,9005272774082
3,Christopher,Bennett,9011245483180
4,Marco,barnes,9902225381086
5,marthinus,Lourens,8105294344187
6,Patience,Banda,5911252957188
7,Tony,Ngwenya,5006191871185
8,gugulethu,Horn,#501004-621-2182#
9,Tumelo,Ebrahim,#751010-414-4187#
10,Priscilla,Jansen,6812103283181


<h4>String Manipulation functions of Interest</h4>
<ul>
    <li>LENGTH(str): Determine Length of a string</li>
    <li>REPLACE(string, pattern, replacement_string)</li>
    <li>TRIM(str)</li>
    <li>LTRIM(str)</li>
    <li>RTRIM(str)</li>
    <li>SUBSTR(str,star_index, num_chars)</li>
    <li>INSTR(str,substring): Return index of first occurence of substr</li>
    <li>UPPER(str)</li>
    <li>LOWER(str)</li>   
    <li>|| : Concatenate strings</li>
</ul>

<b>1. LENGTH</b>

In [14]:
%%sql

SELECT
    Name,
    LENGTH(Name) AS NameLength,
    IDNumber,
    LENGTH(IDNumber) AS IDLength
FROM
    Students
LIMIT 5;

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,NameLength,IDNumber,IDLength
Jan,5,#820410-5405-084#,17
Dumisani,8,9005272774082,13
Christopher,16,9011245483180,13
Marco,19,9902225381086,13
marthinus,9,8105294344187,13


<b>2. REPLACE </b>

In [21]:
%%sql

SELECT
    IDNumber,
    LENGTH(IDNumber) AS LengthOfID,
    REPLACE(IDNumber,'-','') AS IDNumberPreprocessed1, 
    REPLACE(REPLACE(IDNumber,'-',''),'#','') AS CorrectIDFormat, -- correct ID format such that we only have 13 characters in the string,
    LENGTH(REPLACE(REPLACE(IDNumber,'-',''),'#','')) AS LengthOfCorrectID
FROM
    Students
WHERE
    LengthOfID<>13

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


IDNumber,LengthOfID,IDNumberPreprocessed1,CorrectIDFormat,LengthOfCorrectID
#820410-5405-084#,17,#8204105405084#,8204105405084,13
#501004-621-2182#,17,#5010046212182#,5010046212182,13
#751010-414-4187#,17,#7510104144187#,7510104144187,13
#530219-492-6185#,17,#5302194926185#,5302194926185,13
#950510-1851-081#,17,#9505101851081#,9505101851081,13
#561122-1763-085#,17,#5611221763085#,5611221763085,13
621207-5110-185,15,6212075110185,6212075110185,13
960628-4133-180,15,9606284133180,9606284133180,13
651225-0376-186,15,6512250376186,6512250376186,13
870816-0468-082,15,8708160468082,8708160468082,13


<b>3. TRIM </b>

In [23]:
%%sql
SELECT Name, LTRIM(Name) AS LeftTrimmedName,  (LENGTH(Name)-LENGTH(LTRIM(Name))) AS LeftSpaces FROM Students
LIMIT 10

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,LeftTrimmedName,LeftSpaces
Jan,Jan,2
Dumisani,Dumisani,0
Christopher,Christopher,5
Marco,Marco,0
marthinus,marthinus,0
Patience,Patience,0
Tony,Tony,0
gugulethu,gugulethu,0
Tumelo,Tumelo,1
Priscilla,Priscilla,0


<b>4. SUBSTR</b>

In [24]:
%%sql
SELECT
    Name,
    IDNumber,
    SUBSTR(IDNUmber,1,2) AS Year,
    SUBSTR(IDNUmber,3,2) AS Month,
    SUBSTR(IDNUmber,5,2) AS Day
FROM
    Students
LIMIT 5;

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,IDNumber,Year,Month,Day
Jan,#820410-5405-084#,#8,20,41
Dumisani,9005272774082,90,5,27
Christopher,9011245483180,90,11,24
Marco,9902225381086,99,2,22
marthinus,8105294344187,81,5,29


<b>5. INSTR</b>

In [25]:
%%sql
SELECT
    Name,
    IDNumber,
    INSTR(IDNumber,'-') AS FirstOccurrence
FROM
    Students
WHERE LENGTH(IDNumber)>13

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,IDNumber,FirstOccurrence
Jan,#820410-5405-084#,8
gugulethu,#501004-621-2182#,8
Tumelo,#751010-414-4187#,8
Dirk,#530219-492-6185#,8
sello,#950510-1851-081#,8
nicole,#561122-1763-085#,8
Jacqueline,621207-5110-185,7
Louise,960628-4133-180,7
Claire,651225-0376-186,7
Ivan,870816-0468-082,7


<b>6. LOWER, UPPER</b>

In [26]:
%%sql
SELECT
    Name,
    UPPER(Name) AS Uppercase,
    LOWER(Name) AS Lowercase
FROM
    Students
LIMIT 5;

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Uppercase,Lowercase
Jan,JAN,jan
Dumisani,DUMISANI,dumisani
Christopher,CHRISTOPHER,christopher
Marco,MARCO,marco
marthinus,MARTHINUS,marthinus


<b>7. CONCATENATE: ||</b>

In [28]:
%%sql

SELECT
    Name, Surname, Name ||' '|| Surname  AS FullName
FROM
    Students
LIMIT 5;

 * sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Surname,FullName
Jan,Makhanya,Jan Makhanya
Dumisani,Morris,Dumisani Morris
Christopher,Bennett,Christopher Bennett
Marco,barnes,Marco barnes
marthinus,Lourens,marthinus Lourens


<h3>4. Data Transformation Functions</h3>

<ul>
    <li>DISTINCT: Find distinct columns</li>
    <li>IIF(): Conditional outcome</li>
    <li>CASE: multi-choice conditional outcome</li>
    <li>COALESCE(): Function to replace null values</li>
    <li>NULLIF(): Function to make null values </li>
</ul>

In [32]:
%%sql 

sqlite:///../SoftDevEmployees.db

In [33]:
%%sql
SELECT name FROM sqlite_master WHERE type='table'

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


name
Employees


In [36]:
%%sql

SELECT * FROM Employees LIMIT 5

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Surname,Title,Role,Level,Salary,Department
Dumisani,Thwala,,Back-End Developer,Graduate,52171,Web Applications
Tony,Horn,Mr,Back-End Developer,Graduate,103397,Mobile Applications
Vuyokazi,barnes,Mr,Business Analyst,Graduate,69220,Web Applications
sello,Details,Mr,Database Analyst,Graduate,54945,Mobile Applications
Jacqueline,fredericks,,Front-End Developer,Graduate,51104,Web Applications


<b>1. DISTINCT</b>

In [39]:
%%sql
SELECT DISTINCT Department FROM Employees;

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Department
Web Applications
Mobile Applications


<b>2. CASE </b>

<img src="../media/case.png"/>

In [40]:
%%sql
SELECT
    Name,
    Title,
    CASE
        WHEN UPPER(Title) IN ('MS','MRS','MISS') THEN 'Female'
        WHEN UPPER(Title) IN ('MR') THEN 'Male'
        WHEN UPPER(Title) IS NULL THEN 'Value not specified'
    ELSE
        'Cannot Determine from Title'
    END AS Gender
FROM
    Employees
ORDER BY Name
LIMIT 5;

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Title,Gender
André,Mrs,Female
Antoinette,Dr,Cannot Determine from Title
Bronwyn,,Value not specified
Christopher,,Value not specified
Claire,Ms,Female


<b>3. IIF</b><br/>
IIF(condition_x,result_1,result_2)

In [41]:
%%sql

SELECT
    Name,
    Title,
    IIF(UPPER(Title) IN ('MS','MRS','MISS'),'Female',
        IIF(UPPER(Title) IN ('MR'),'Male','Cannot Determine from Title'))AS Gender
FROM
    Employees
ORDER BY Name
LIMIT 5;

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Title,Gender
André,Mrs,Female
Antoinette,Dr,Cannot Determine from Title
Bronwyn,,Cannot Determine from Title
Christopher,,Cannot Determine from Title
Claire,Ms,Female


<b>4. COALESCE</b><br/>
return the first non-null argument value


In [42]:
%%sql
SELECT
    
    Name,
    Surname,
    Level,
    Title,
    COALESCE(Title,'No title available') as Title_Coalesce
FROM Employees
LIMIT 5;

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Surname,Level,Title,Title_Coalesce
Dumisani,Thwala,Graduate,,No title available
Tony,Horn,Graduate,Mr,Mr
Vuyokazi,barnes,Graduate,Mr,Mr
sello,Details,Graduate,Mr,Mr
Jacqueline,fredericks,Graduate,,No title available


<b>5. NULLIF </b>

In [43]:
%%sql
SELECT
    Name,
    Surname,
    Level,
    NULLIF(Level,'Intern') as Role
FROM Employees
WHERE Level = 'Intern'

 * sqlite:///../SoftDevEmployees.db
   sqlite:///../Students.db
   sqlite:///../chinook.db
Done.


Name,Surname,Level,Role
Jan,Ngwenya,Intern,
Patience,Willemse,Intern,
Dirk,Banda,Intern,
Janine,De Villiers,Intern,
barend,Edwards,Intern,
Jabulani,Horn,Intern,
kelly,Manuel,Intern,
Claire,Morris,Intern,
Janet,Patel,Intern,
Pearl,Stewart,Intern,
