# Part III: Explore

![alt text](data/inu_neko_logo_small.png "Inu + Neko")

Hello! We are Inu + Neku and we are a Dog & Cat services and supplies store located in New York City. We just started our e-commerce business and need your help analyzing our data!

## Description

We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:

- Duplicate records
- Consistent formatting
- Missing values
- Obviously wrong values

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_cleaned = pd.read_csv('data/inu_neko_orderline_clean.csv')
df_cleaned

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
0,10300097,719638485153,1001019,2021-01-01 07:35:21.439873,2021,1,1,1,1,20,New York,72.99,Cat Cave,bedding,cat,72.99
1,10300093,73201504044,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Purrfect Puree,treat,cat,18.95
2,10300093,719638485153,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,72.99,Cat Cave,bedding,cat,72.99
3,10300093,441530839394,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,2,34,New York,28.45,Ball and String,toy,cat,56.90
4,10300093,733426809698,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Yum Fish-Dish,food,cat,18.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38218,10327860,287663658863,1022098,2021-06-30 15:37:12.821020,2021,6,30,30,1,25,New York,9.95,All Veggie Yummies,treat,dog,9.95
38219,10327960,140160459467,1022157,2021-06-30 15:45:09.872732,2021,6,30,30,2,31,Pennsylvania,48.95,Snoozer Essentails,bedding,dog,97.90
38220,10328009,425361189561,1022189,2021-06-30 15:57:44.295104,2021,6,30,30,2,53,New Jersey,15.99,Snack-em Fish,treat,cat,31.98
38221,10328089,733426809698,1022236,2021-06-30 15:59:29.801593,2021,6,30,30,1,23,Tennessee,18.95,Yum Fish-Dish,food,cat,18.95


In [3]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38223 entries, 0 to 38222
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   trans_id          38223 non-null  int64  
 1   prod_upc          38223 non-null  int64  
 2   cust_id           38223 non-null  int64  
 3   trans_timestamp   38223 non-null  object 
 4   trans_year        38223 non-null  int64  
 5   trans_month       38223 non-null  int64  
 6   trans_day         38223 non-null  int64  
 7   trans_hour        38223 non-null  int64  
 8   trans_quantity    38223 non-null  int64  
 9   cust_age          38223 non-null  int64  
 10  cust_state        38223 non-null  object 
 11  prod_price        38223 non-null  float64
 12  prod_title        38223 non-null  object 
 13  prod_category     38223 non-null  object 
 14  prod_animal_type  38223 non-null  object 
 15  total_sales       38223 non-null  float64
dtypes: float64(2), int64(9), object(5)
memor

#### Question 1: Number of Orders

How many transactions are there?

In [4]:
# your code here

num_trans = len(df_cleaned)
num_trans

38223

In [5]:
# Q1 Test Cases
assert 39000 > num_trans > 25000

#### Question 2: Alpha and Omega I
What was the month and day of the first sale? Store as a tuple in that order and assign the tuple to the variable `first_date`.

In [6]:
# your code here
mth = df_cleaned.loc[0]["trans_month"]
day = df_cleaned.loc[0]["trans_day"]
first_date = (mth,day)
first_date

(1, 1)

In [7]:
len(first_date)

2

In [8]:
str(first_date[0]).isnumeric() 

True

In [9]:
str(first_date[1]).isnumeric()

True

In [10]:
# Q2 Test Cases
assert len(first_date) == 2
assert str(first_date[0]).isnumeric() 
assert str(first_date[1]).isnumeric() 

#### Question 3: Alpha and Omega II
What was the month and day of the last sale? Store as a tuple in that order and assign the tuple to the variable `last_date`.

In [11]:
# your code here
mth = df_cleaned.loc[38222]["trans_month"]
day = df_cleaned.loc[38222]["trans_day"]
last_date = (mth,day)
last_date

(6, 30)

In [12]:
len(last_date)

2

In [13]:
str(last_date[0]).isnumeric() 

True

In [14]:
str(last_date[1]).isnumeric()

True

In [15]:
# Q3 Test Cases
assert len(last_date) == 2
assert str(last_date[0]).isnumeric() 
assert str(last_date[1]).isnumeric() 

#### Question 4: Cats vs Dogs

Which animal product type is most popular?

In [16]:
# your code here
df_cleaned.prod_animal_type.value_counts()
names = df_cleaned.prod_animal_type.value_counts().keys()
most_pop = names[0]

In [17]:
most_pop

'cat'

In [18]:
type(most_pop)

str

In [19]:
most_pop in ['cat', 'dog']

True

In [20]:
# Q4 Test Cases
assert type(most_pop) == str
assert most_pop in ['cat', 'dog']

#### Question 5: More Money More Problems I

What was the total dollar amount made in the month of January? Store this in the variable `jan_rev`.

In [21]:
# your code here
df_cleaned["trans_month"].value_counts()

6    13256
5     9858
4     6884
3     4645
2     2374
1     1206
Name: trans_month, dtype: int64

In [22]:
sales_jan = df_cleaned[df_cleaned["trans_month"] == 1]

In [23]:
sales_jan

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
0,10300097,719638485153,1001019,2021-01-01 07:35:21.439873,2021,1,1,1,1,20,New York,72.99,Cat Cave,bedding,cat,72.99
1,10300093,73201504044,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Purrfect Puree,treat,cat,18.95
2,10300093,719638485153,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,72.99,Cat Cave,bedding,cat,72.99
3,10300093,441530839394,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,2,34,New York,28.45,Ball and String,toy,cat,56.90
4,10300093,733426809698,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Yum Fish-Dish,food,cat,18.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,10300964,100469015054,1001812,2021-01-31 13:38:49.942487,2021,1,31,31,1,24,Tennessee,18.95,Tuna Tasties,treat,cat,18.95
1202,10300948,287663658863,1001798,2021-01-31 13:52:06.681110,2021,1,31,31,1,29,Pennsylvania,9.95,All Veggie Yummies,treat,dog,9.95
1203,10300961,717036112695,1001809,2021-01-31 14:12:53.346135,2021,1,31,31,1,38,Pennsylvania,60.99,Reddy Beddy,bedding,dog,60.99
1204,10300961,575410882303,1001809,2021-01-31 14:12:53.346135,2021,1,31,31,1,38,Pennsylvania,21.95,Chomp-a Plush,toy,dog,21.95


In [24]:
jan_rev = sales_jan["total_sales"].sum()

In [25]:
jan_rev

51739.740000000005

In [26]:
# Q5 Test Cases
assert  40000 <= jan_rev <= 60000

#### Question 6: More Money More Problems II

What was the total dollar amount made in the month of January? Store this in the variable `june_rev`.

In [27]:
# your code here
sales_june = df_cleaned[df_cleaned["trans_month"] == 6]

In [28]:
sales_june

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
24967,10318591,733426809698,1015889,2021-06-01 05:32:36.564058,2021,6,1,1,1,25,Virginia,18.95,Yum Fish-Dish,food,cat,18.95
24968,10318416,719638485153,1012671,2021-06-01 06:39:02.110506,2021,6,1,1,1,36,New York,72.99,Cat Cave,bedding,cat,72.99
24969,10318416,287663658863,1012671,2021-06-01 06:39:02.110506,2021,6,1,1,1,36,New York,9.95,All Veggie Yummies,treat,dog,9.95
24970,10318568,344538897332,1015876,2021-06-01 06:40:46.147412,2021,6,1,1,1,31,Texas,19.99,Feline Fix Mix,treat,cat,19.99
24971,10318568,832878954342,1015876,2021-06-01 06:40:46.147412,2021,6,1,1,1,31,Texas,45.99,Snoozer Hammock,bedding,cat,45.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38218,10327860,287663658863,1022098,2021-06-30 15:37:12.821020,2021,6,30,30,1,25,New York,9.95,All Veggie Yummies,treat,dog,9.95
38219,10327960,140160459467,1022157,2021-06-30 15:45:09.872732,2021,6,30,30,2,31,Pennsylvania,48.95,Snoozer Essentails,bedding,dog,97.90
38220,10328009,425361189561,1022189,2021-06-30 15:57:44.295104,2021,6,30,30,2,53,New Jersey,15.99,Snack-em Fish,treat,cat,31.98
38221,10328089,733426809698,1022236,2021-06-30 15:59:29.801593,2021,6,30,30,1,23,Tennessee,18.95,Yum Fish-Dish,food,cat,18.95


In [29]:
june_rev = sales_june["total_sales"].sum()

In [30]:
june_rev

548822.73

In [31]:
# Q6 Test Cases
assert  500000 <= june_rev <= 600000

#### Question 7: Transaction Size

What is the average number of items bought in each transaction? Sore this in the variable `avg_num_items`.

In [32]:
# your code here

avg_num_items = df_cleaned['trans_quantity'].sum()/df_cleaned['trans_id'].drop_duplicates().count()

In [33]:
avg_num_items

1.8745628434801227

In [34]:
# Q7 Test Cases
assert  0 <= avg_num_items <= 3

#### Question 8: Best Products I

What are the top ten product titles by the number of sales? Display in descending order. Store in variable `top_num_sales`.

In [35]:
# your code here

df_cleaned.groupby(["prod_title"])["trans_quantity"].sum().sort_values(ascending=False).head(10)

prod_title
Reddy Beddy               6583
Yum Fish-Dish             4298
Kitty Climber             3329
Feline Fix Mix            3262
Tuna Tasties              3102
Chewie Dental             2579
Purrfect Puree            2453
Whole Chemistry Recipe    2410
Cat Cave                  2408
Snoozer Hammock           2311
Name: trans_quantity, dtype: int64

In [36]:
df_cleaned['prod_title'].value_counts().head(10)

Reddy Beddy               4734
Yum Fish-Dish             3086
Kitty Climber             2438
Feline Fix Mix            2412
Tuna Tasties              2276
Chewie Dental             1911
Purrfect Puree            1795
Cat Cave                  1762
Whole Chemistry Recipe    1747
All Veggie Yummies        1702
Name: prod_title, dtype: int64

In [37]:
top_num_sales = top_num_sales=df_cleaned['prod_title'].value_counts().head(10)
top_num_sales

Reddy Beddy               4734
Yum Fish-Dish             3086
Kitty Climber             2438
Feline Fix Mix            2412
Tuna Tasties              2276
Chewie Dental             1911
Purrfect Puree            1795
Cat Cave                  1762
Whole Chemistry Recipe    1747
All Veggie Yummies        1702
Name: prod_title, dtype: int64

In [38]:
len(top_num_sales)

10

In [39]:
# Q8 Test Cases
assert len(top_num_sales) == 10

#### Question 9: Best Products II

What are the top ten product titles by total dollar amount made? Display in descending order. Store in variable `top_tot_sales`.

In [40]:
top_num_sales = df_cleaned.groupby('prod_title').agg('total_sales').sum().reset_index()
top_tot_sales =top_num_sales.sort_values("total_sales",ascending=False).head(10)
top_tot_sales=top_tot_sales["prod_title"]
top_tot_sales

11           Reddy Beddy
2               Cat Cave
8          Kitty Climber
15       Snoozer Hammock
14    Snoozer Essentails
20         Yum Fish-Dish
12         Scratchy Post
5         Feline Fix Mix
7            Foozy Mouse
18          Tuna Tasties
Name: prod_title, dtype: object

In [41]:
# your code here

df_cleaned.groupby(["prod_title"])["total_sales"].sum().sort_values(ascending=False).head(10)

prod_title
Reddy Beddy           408023.09
Cat Cave              175759.92
Kitty Climber         119810.71
Snoozer Hammock       106282.89
Snoozer Essentails    100739.10
Yum Fish-Dish          81447.10
Scratchy Post          65951.34
Feline Fix Mix         65207.38
Foozy Mouse            61460.37
Tuna Tasties           58782.90
Name: total_sales, dtype: float64

In [42]:
top_tot_sales = df_cleaned.groupby(["prod_title"])["total_sales"].sum().sort_values(ascending=False).head(10)
top_tot_sales

prod_title
Reddy Beddy           408023.09
Cat Cave              175759.92
Kitty Climber         119810.71
Snoozer Hammock       106282.89
Snoozer Essentails    100739.10
Yum Fish-Dish          81447.10
Scratchy Post          65951.34
Feline Fix Mix         65207.38
Foozy Mouse            61460.37
Tuna Tasties           58782.90
Name: total_sales, dtype: float64

In [43]:
len(top_tot_sales)

10

In [44]:
# Q9 Test Cases
assert len(top_tot_sales) == 10

#### Question 10: Bonus

What is the proportion of returning customers? Store as variable `prop_returning`.

In [45]:
# your code here

df_cleaned.groupby("cust_id")["trans_id"].count()

cust_id
1001012    3
1001013    1
1001014    3
1001015    4
1001016    2
          ..
1022248    1
1022249    1
1022250    1
1022251    1
1022252    2
Name: trans_id, Length: 21241, dtype: int64

In [46]:
df2 = pd.DataFrame(df_cleaned.groupby("cust_id")["trans_id"].count())
df2

Unnamed: 0_level_0,trans_id
cust_id,Unnamed: 1_level_1
1001012,3
1001013,1
1001014,3
1001015,4
1001016,2
...,...
1022248,1
1022249,1
1022250,1
1022251,1


In [47]:
returncustomers = (df2["trans_id"] > 1).sum()
returncustomers

9606

In [48]:
onetimecustomers = (df2["trans_id"] == 1).sum()
onetimecustomers

11635

In [49]:
len(df2)

21241

In [50]:
prop_returning = returncustomers / len(df2)

In [51]:
0 <= prop_returning <= 1

True

In [52]:
# Q10 Test Cases
assert 0 <= prop_returning <= 1