# Analysis of E-Commerce Dataset

The dataset is downloaded from Kaggle and can be found [here](https://www.kaggle.com/datasets/utkarsharya/ecommerce-purchases)

The data dictionary provided for this dataset is as the following:
- 'Address' - customer's address.
- 'Browser Info' - info regarding the browser of the customer.
- 'Company' - the company in which the customer work.
- 'Credit Card' - number of the customer's credit card.
- 'CC Exp Date' - the expiray date of teh customer's credit card.
- 'CC Security Code' - the security code of the customer's credit card.
- 'CC Provider' - name of the caompany provided the credit card.
- 'Email' - customer's email.
- 'Job' - customer's job title.
- 'IP Address' - customers' IP Address.
- 'Language' - customer's language.
- 'Purchase Price' - price of the item purchased



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option('precision',2)

In [4]:
data = pd.read_csv('../data/Ecommerce Purchases.csv')
data.head()

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
0,"16629 Pace Camp Apt. 448\nAlexisborough, NE 77...",46 in,PM,Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2...,Martinez-Herman,6011929061123406,02/20,900,JCB 16 digit,pdunlap@yahoo.com,"Scientist, product/process development",149.146.147.205,el,98.14
1,"9374 Jasmine Spurs Suite 508\nSouth John, TN 8...",28 rn,PM,Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr...,"Fletcher, Richards and Whitaker",3337758169645356,11/18,561,Mastercard,anthony41@reed.com,Drilling engineer,15.160.41.51,fr,70.73
2,Unit 0065 Box 5052\nDPO AP 27450,94 vE,PM,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Simpson, Williams and Pham",675957666125,08/19,699,JCB 16 digit,amymiller@morales-harrison.com,Customer service manager,132.207.160.22,de,0.95
3,"7780 Julia Fords\nNew Stacy, WA 45798",36 vm,PM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ...,"Williams, Marshall and Buchanan",6011578504430710,02/24,384,Discover,brent16@olson-robinson.info,Drilling engineer,30.250.74.19,es,78.04
4,"23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...",20 IE,AM,Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2...,"Brown, Watson and Andrews",6011456623207998,10/25,678,Diners Club / Carte Blanche,christopherwright@gmail.com,Fine artist,24.140.33.94,es,77.82


In [38]:
data.duplicated().sum()

0

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           10000 non-null  object 
 1   Lot               10000 non-null  object 
 2   AM or PM          10000 non-null  object 
 3   Browser Info      10000 non-null  object 
 4   Company           10000 non-null  object 
 5   Credit Card       10000 non-null  int64  
 6   CC Exp Date       10000 non-null  object 
 7   CC Security Code  10000 non-null  int64  
 8   CC Provider       10000 non-null  object 
 9   Email             10000 non-null  object 
 10  Job               10000 non-null  object 
 11  IP Address        10000 non-null  object 
 12  Language          10000 non-null  object 
 13  Purchase Price    10000 non-null  float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB


## Analysis of the Address

In [48]:
data['Address']

0       16629 Pace Camp Apt. 448\nAlexisborough, NE 77...
1       9374 Jasmine Spurs Suite 508\nSouth John, TN 8...
2                        Unit 0065 Box 5052\nDPO AP 27450
3                   7780 Julia Fords\nNew Stacy, WA 45798
4       23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...
                              ...                        
9995        966 Castaneda Locks\nWest Juliafurt, CO 96415
9996    832 Curtis Dam Suite 785\nNorth Edwardburgh, T...
9997                Unit 4434 Box 6343\nDPO AE 28026-0283
9998                 0096 English Rest\nRoystad, IA 12457
9999       40674 Barrett Stravenue\nGrimesville, WI 79682
Name: Address, Length: 10000, dtype: object

In [64]:
data['Address'].str.split('\n').str.get(1)

0       Alexisborough, NE 77130-7478
1          South John, TN 84355-4179
2                       DPO AP 27450
3                New Stacy, WA 45798
4              New Cynthia, TX 57826
                    ...             
9995        West Juliafurt, CO 96415
9996     North Edwardburgh, TX 55158
9997               DPO AE 28026-0283
9998               Roystad, IA 12457
9999           Grimesville, WI 79682
Name: Address, Length: 10000, dtype: object

It can be seen that some of the addresses has DPO abbreviation in them. Let's explore that.

In [63]:
data['Address'][data['Address'].str.split('\n').str.get(1).str.contains('DPO')].\
        str.split('\n').str.get(1).str.split(' ').str.get(1).value_counts()

AP    129
AA    127
AE    124
Name: Address, dtype: int64

After conducting an online research, it turn out that DPO stands for Diplomatic Post Office, AA stands for Armed Forces America, AE stands for Armed Forces and AP stands for Armed Forces Pacific.
[Source](https://knowledgecenter.zuora.com/BB_Introducing_Z_Business/D_Country%2C_State%2C_and_Province_Codes/B_State_Names_and_2-Digit_Codes)

In [66]:
data['Location'] = data['Address'].str.split('\n').str.get(1)

0       Alexisborough, NE 77130-7478
1          South John, TN 84355-4179
2                       DPO AP 27450
3                New Stacy, WA 45798
4              New Cynthia, TX 57826
                    ...             
9995        West Juliafurt, CO 96415
9996     North Edwardburgh, TX 55158
9997               DPO AE 28026-0283
9998               Roystad, IA 12457
9999           Grimesville, WI 79682
Name: Address, Length: 10000, dtype: object

In [72]:
data['Address'].str.split('\n').str.get(1).str.split(', ').str.get(1).str.split(' ').str.get(0)

0        NE
1        TN
2       NaN
3        WA
4        TX
       ... 
9995     CO
9996     TX
9997    NaN
9998     IA
9999     WI
Name: Address, Length: 10000, dtype: object