# Anonymize the dataset

The dataset contains sensitive information about customers. We will convert the customer's first and last name into a distinctive hash. We will also reorganize the variables to keep those that are most useful to us.

In [63]:
import pandas as pd

orders = pd.read_csv('orders.csv')

First glance of orders data

In [64]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Order ID               264 non-null    int64  
 1   Order Date             264 non-null    object 
 2   First Name             262 non-null    object 
 3   Last Name              262 non-null    object 
 4   Email                  262 non-null    object 
 5   Company                24 non-null     object 
 6   Phone                  262 non-null    object 
 7   Street                 262 non-null    object 
 8   City                   262 non-null    object 
 9   State/Province         260 non-null    object 
 10  Postal/Zip Code        262 non-null    object 
 11  Country                262 non-null    object 
 12  Shipping Instructions  16 non-null     object 
 13  Shipping Method        262 non-null    object 
 14  Shipping Total         262 non-null    float64
 15  Discou

Let's import the hashlib library and convert the customer name to a hash. See the results.

In [65]:
import hashlib
 
orders['Customer Name'] = orders['First Name'] + ' ' + orders['Last Name']
orders['Customer Name'] = orders['Customer Name'].astype(str)

# Apply hashing function to the column
orders['Customer ID'] = orders['Customer Name'].apply(
    lambda x: 
        hashlib.sha256(x.encode()).hexdigest()
)

orders["Customer ID"].value_counts()


757f33ad9fbcf1f55a1f6420161ed88772d990bf14a1a18c152d6ca4f3fc13e4    4
604d763a54bbb2affa2a56f7e893a692551ff4268ac09644af11800b168b3c12    3
ee97429a3a37102bf2a3fb0d3d11a2585b6dfa2f201c472f42705687f3dbac7f    3
2b399d7420dc7bc7af563099b028273c83adae8cf92fef0d4fc1b992765d109d    2
29b2bac5decdc8aa13c07ad0c2aee6076099044cc28bb450650575095b0fb248    2
                                                                   ..
76139883ebc5fc9f384d2de559c68af551cede6c2fb2152d9872fd3532908498    1
65518f673f0f0ee95524bbfac829fac90915a87b89fe923155080b81621f5fc9    1
f3c84917b5a85ccec7491d0b233c6381303a8a8350b03bd8f7867996710c84f3    1
bb896dda33736b77df1bdd93fc3d85bae54c6b2a75c12b04db7297c8fdba1b10    1
6c9509162976dc640ff232283affc522f04b0425ad2171a510115f67bd0321a3    1
Name: Customer ID, Length: 236, dtype: int64

In [66]:
orders.rename(columns = {'Shipped':'Shipping Date'}, inplace = True)
orders = orders[['Order ID', 'Status', 'Order Date', 'Shipping Date', 'Refund Date', 'Customer ID', 'City','State/Province', 'Postal/Zip Code', 'Country', 'Product Title', 'Option Summary', 'Quantity', 'Unit Price', 'Discount Price', 'Item Total', 'Shipping Total', 'Discount Total', 'Order Total']]
orders.head()

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
0,134029,billed,2018-12-07,2018-12-12,,79b517750071a0fce0ea0c2ef27fc40d5063df78aac79c...,Brookings,SD,57006,United States of America,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM3438 - Fully Assembled,1,124.38,124.38,124.38,0.0,0.0,124.38
1,136661,billed,2019-01-04,2019-01-13,,5f54c081a80b3cd0960794be1ea8f4fbd1bb977f7d9b30...,Neustadt,RP,67433,Germany,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
2,136829,billed,2019-01-05,2019-01-16,,1eaa433ace9b356d976ae83bdfac56282ef0fa7afcc05f...,Auckland,Auckland,1024,New Zealand,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
3,137381,billed,2019-01-10,2019-01-19,,92ce259747bc2c850787c5e27547416aa4da0a13a23708...,Berlin,Berlin,10409,Germany,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
4,142040,billed,2019-02-23,2019-02-25,,d1b71ad194e919d69cabdf143cea070293c4190d6c3be3...,Bluff City,TN,37618,United States of America,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,156.11,129.99,129.99,19.99,26.12,149.98


In [67]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order ID         264 non-null    int64  
 1   Status           264 non-null    object 
 2   Order Date       264 non-null    object 
 3   Shipping Date    253 non-null    object 
 4   Refund Date      13 non-null     object 
 5   Customer ID      264 non-null    object 
 6   City             262 non-null    object 
 7   State/Province   260 non-null    object 
 8   Postal/Zip Code  262 non-null    object 
 9   Country          262 non-null    object 
 10  Product Title    264 non-null    object 
 11  Option Summary   264 non-null    object 
 12  Quantity         264 non-null    int64  
 13  Unit Price       264 non-null    float64
 14  Discount Price   264 non-null    float64
 15  Item Total       264 non-null    float64
 16  Shipping Total   262 non-null    float64
 17  Discount Total  

Finally we can save the file in csv format with the name anonym_orders.csv

In [68]:
orders.to_csv('anonym_orders.csv', index = False)