# Part III: Explore

![alt text](data/inu_neko_logo_small.png "Inu + Neko")

Hello! We are Inu + Neku and we are a Dog & Cat services and supplies store located in New York City. We just started our e-commerce business and need your help analyzing our data!

## Description

We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:

- Duplicate records
- Consistent formatting
- Missing values
- Obviously wrong values

> **NOTE:** You can check if your answer is at least close to the correct/expected answer with the check functions (`q1_check()`, `q2_check()`, ...). These functions will check your answer and give you some feedback. However, your answer might be _incorrect_ even if the check functions says you're "close" to the expected answer.

In [1]:
import pandas as pd
import numpy as np


from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_practice_explore import *

In [2]:
df_cleaned = pd.read_csv('data/inu_neko_orderline_cleaned.csv', parse_dates=["trans_timestamp"])
df_cleaned

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
0,10311803,704772572943,1010865,2021-05-04 14:20:00.426577,2021,5,4,4,1,21,Pennsylvania,35.98,Scratchy Post,toy,cat,35.98
1,10300426,441530839394,1001150,2021-01-19 10:03:15.598881,2021,1,19,19,2,21,Oregon,28.45,Ball and String,toy,cat,56.90
2,10311471,969568933713,1010603,2021-05-02 09:41:06.068295,2021,5,2,2,1,44,Texas,32.99,Foozy Mouse,toy,cat,32.99
3,10306506,100469015054,1006616,2021-04-03 08:22:05.599849,2021,4,3,3,2,25,New Jersey,18.95,Tuna Tasties,treat,cat,37.90
4,10308646,521244155990,1008375,2021-04-17 09:44:51.009960,2021,4,17,17,1,30,New Jersey,54.95,Reddy Beddy,bedding,dog,54.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23751,10305783,832878954342,1006011,2021-03-29 10:48:52.680900,2021,3,29,29,1,40,New York,45.99,Snoozer Hammock,bedding,cat,45.99
23752,10317151,441530839394,1014854,2021-05-27 12:15:18.931957,2021,5,27,27,1,21,New Mexico,28.45,Ball and String,toy,cat,28.45
23753,10302720,733426809698,1003380,2021-02-28 10:13:34.969109,2021,2,28,28,1,26,Michigan,18.95,Yum Fish-Dish,food,cat,18.95
23754,10314978,733426809698,1013297,2021-05-18 10:51:44.007914,2021,5,18,18,1,29,Pennsylvania,18.95,Yum Fish-Dish,food,cat,18.95


In [3]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23756 entries, 0 to 23755
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   trans_id          23756 non-null  int64         
 1   prod_upc          23756 non-null  int64         
 2   cust_id           23756 non-null  int64         
 3   trans_timestamp   23756 non-null  datetime64[ns]
 4   trans_year        23756 non-null  int64         
 5   trans_month       23756 non-null  int64         
 6   trans_day         23756 non-null  int64         
 7   trans_hour        23756 non-null  int64         
 8   trans_quantity    23756 non-null  int64         
 9   cust_age          23756 non-null  int64         
 10  cust_state        23756 non-null  object        
 11  prod_price        23756 non-null  float64       
 12  prod_title        23756 non-null  object        
 13  prod_category     23756 non-null  object        
 14  prod_animal_type  2375

#### Question 1: Number of Orders

How many transactions are there?

In [4]:
df_cleaned.trans_id.value_counts().count()

17416

In [5]:
# your code here

num_trans = df_cleaned.trans_id.value_counts().count()

In [6]:
# Q1 Test Cases
check_q1()


[92mYour answer `17416` for the `num_trans` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 2: Alpha and Omega I
What was the month and day of the first sale? Store as a tuple in that order and assign the tuple to the variable `first_date`.

In [7]:
df_cleaned.loc[0]

trans_id                              10311803
prod_upc                          704772572943
cust_id                                1010865
trans_timestamp     2021-05-04 14:20:00.426577
trans_year                                2021
trans_month                                  5
trans_day                                    4
trans_hour                                   4
trans_quantity                               1
cust_age                                    21
cust_state                        Pennsylvania
prod_price                               35.98
prod_title                       Scratchy Post
prod_category                              toy
prod_animal_type                           cat
total_sales                              35.98
Name: 0, dtype: object

In [8]:
# your code here

first_date = (df_cleaned.trans_month.loc[0], df_cleaned.trans_day.loc[0])

In [9]:
first_date

(5, 4)

In [10]:
# Q2 Test Cases
check_q2()


[92mYour answer `(5, 4)` for `first_date` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 3: Alpha and Omega II
What was the month and day of the last sale? Store as a tuple in that order and assign the tuple to the variable `last_date`.

In [11]:
df_cleaned.loc[23755]

trans_id                              10316727
prod_upc                          469757173540
cust_id                                1014548
trans_timestamp     2021-05-25 11:54:05.656762
trans_year                                2021
trans_month                                  5
trans_day                                   25
trans_hour                                  25
trans_quantity                               4
cust_age                                    31
cust_state                              Nevada
prod_price                               35.99
prod_title                       Kitty Climber
prod_category                              toy
prod_animal_type                           cat
total_sales                             143.96
Name: 23755, dtype: object

In [12]:
# your code here

last_date = (df_cleaned.trans_month.loc[23755], df_cleaned.trans_day.loc[23755])

In [13]:
# Q3 Test Cases
check_q3()


[92mYour answer `(5, 25)` for `last_date` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 4: Cats vs Dogs

Which animal product type is most popular?

In [14]:
df_cleaned.prod_animal_type.value_counts()

cat    13660
dog    10096
Name: prod_animal_type, dtype: int64

In [15]:
df2 = pd.DataFrame(df_cleaned.prod_animal_type.value_counts())
df2

Unnamed: 0,prod_animal_type
cat,13660
dog,10096


In [16]:
# your code here

most_pop = df2.index[0]

In [17]:
most_pop

'cat'

In [18]:
# Q4 Test Cases
check_q4()


[92mYour answer `cat` for the `most_pop` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 5: More Money More Problems I

What was the total dollar amount made in the month of January? Store this in the variable `jan_rev`.

In [19]:
df3 = df_cleaned[df_cleaned.trans_month == 1]
df3

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
1,10300426,441530839394,1001150,2021-01-19 10:03:15.598881,2021,1,19,19,2,21,Oregon,28.45,Ball and String,toy,cat,56.90
34,10300269,344538897332,1001179,2021-01-13 08:25:40.635465,2021,1,13,13,2,20,North Carolina,19.99,Feline Fix Mix,treat,cat,39.98
36,10300213,845773115334,1001128,2021-01-08 10:30:08.064197,2021,1,8,8,1,31,New York,12.99,Purr Mix,food,cat,12.99
37,10300803,845773115334,1001665,2021-01-28 06:23:29.537982,2021,1,28,28,1,29,Idaho,12.99,Purr Mix,food,cat,12.99
79,10300720,969568933713,1001521,2021-01-26 11:41:53.077390,2021,1,26,26,4,40,New Jersey,32.99,Foozy Mouse,toy,cat,131.96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23646,10300866,904582148679,1001723,2021-01-29 10:09:43.347188,2021,1,29,29,1,40,New Jersey,12.97,Whole Chemistry Recipe,food,dog,12.97
23661,10300840,140160459467,1001698,2021-01-29 14:19:43.033245,2021,1,29,29,3,24,Connecticut,48.95,Snoozer Essentails,bedding,dog,146.85
23675,10300387,344934101144,1001289,2021-01-17 08:26:37.366229,2021,1,17,17,1,26,California,24.95,Fetch Blaster,toy,dog,24.95
23738,10300717,344934101144,1001592,2021-01-26 10:05:05.511948,2021,1,26,26,1,34,New Mexico,24.95,Fetch Blaster,toy,dog,24.95


In [20]:
# your code here

jan_rev = df3["total_sales"].sum()

jan_rev

51739.73999999999

In [21]:
# Q5 Test Cases
check_q5()


[92mYour answer `51739.73999999999` for the `jan_rev` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 6: More Money More Problems II

What was the total dollar amount made in the month of May? Store this in the variable `may_rev`.

In [22]:
df4 = df_cleaned[df_cleaned.trans_month == 5]
df4

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
0,10311803,704772572943,1010865,2021-05-04 14:20:00.426577,2021,5,4,4,1,21,Pennsylvania,35.98,Scratchy Post,toy,cat,35.98
2,10311471,969568933713,1010603,2021-05-02 09:41:06.068295,2021,5,2,2,1,44,Texas,32.99,Foozy Mouse,toy,cat,32.99
7,10315454,483326155497,1013644,2021-05-20 12:13:35.082788,2021,5,20,20,2,42,California,10.99,The New Bone,food,dog,21.98
12,10317084,733426809698,1014811,2021-05-27 09:19:44.655972,2021,5,27,27,1,25,New Jersey,18.95,Yum Fish-Dish,food,cat,18.95
14,10315210,733426809698,1013468,2021-05-19 10:05:29.322772,2021,5,19,19,1,25,New York,18.95,Yum Fish-Dish,food,cat,18.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23748,10314147,733426809698,1012675,2021-05-15 12:06:05.808038,2021,5,15,15,1,21,Ohio,18.95,Yum Fish-Dish,food,cat,18.95
23750,10315238,704772572943,1013490,2021-05-19 09:13:51.713197,2021,5,19,19,1,54,California,35.98,Scratchy Post,toy,cat,35.98
23752,10317151,441530839394,1014854,2021-05-27 12:15:18.931957,2021,5,27,27,1,21,New Mexico,28.45,Ball and String,toy,cat,28.45
23754,10314978,733426809698,1013297,2021-05-18 10:51:44.007914,2021,5,18,18,1,29,Pennsylvania,18.95,Yum Fish-Dish,food,cat,18.95


In [23]:
# your code here

may_rev = df4["total_sales"].sum()

may_rev

370507.59

In [24]:
# Q6 Test Cases
check_q6()


[92mYour answer `370507.59` for the `may_rev` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 7: Transaction Size

What is the average number of items bought in each transaction? Sore this in the variable `avg_num_items`.

In [25]:
num_trans

17416

In [26]:
df_cleaned.trans_quantity.sum()

32777

In [27]:
# your code here

avg_num_items = df_cleaned.trans_quantity.sum() / num_trans

avg_num_items

1.8820050528249885

In [28]:
# Q7 Test Cases
check_q7()


[92mYour answer `1.8820050528249885` for the `avg_num_items` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 8: Best Products I

What are the top ten product titles by the total number of items sold for that product? Display in descending order. Store in variable `top_num_sales`.

In [29]:
df_cleaned.groupby(["prod_title"])["trans_quantity"].sum().nlargest(10)

prod_title
Reddy Beddy               4220
Yum Fish-Dish             2613
Feline Fix Mix            2116
Kitty Climber             2063
Tuna Tasties              1926
Chewie Dental             1550
Cat Cave                  1549
Purrfect Puree            1528
Whole Chemistry Recipe    1502
Snoozer Hammock           1431
Name: trans_quantity, dtype: int64

In [30]:
df5 = df_cleaned.groupby(["prod_title"])["trans_quantity"].sum().nlargest(10)

df5

prod_title
Reddy Beddy               4220
Yum Fish-Dish             2613
Feline Fix Mix            2116
Kitty Climber             2063
Tuna Tasties              1926
Chewie Dental             1550
Cat Cave                  1549
Purrfect Puree            1528
Whole Chemistry Recipe    1502
Snoozer Hammock           1431
Name: trans_quantity, dtype: int64

In [31]:
df5.index

Index(['Reddy Beddy', 'Yum Fish-Dish', 'Feline Fix Mix', 'Kitty Climber',
       'Tuna Tasties', 'Chewie Dental', 'Cat Cave', 'Purrfect Puree',
       'Whole Chemistry Recipe', 'Snoozer Hammock'],
      dtype='object', name='prod_title')

In [32]:
# your code here

top_num_sales = df5.index

top_num_sales

Index(['Reddy Beddy', 'Yum Fish-Dish', 'Feline Fix Mix', 'Kitty Climber',
       'Tuna Tasties', 'Chewie Dental', 'Cat Cave', 'Purrfect Puree',
       'Whole Chemistry Recipe', 'Snoozer Hammock'],
      dtype='object', name='prod_title')

In [33]:
# Q8 Test Cases
check_q8()


[92mYour answer for the `top_num_sales` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 9: Best Products II

What are the top ten product titles by total dollar amount made? Display in descending order. Store in variable `top_tot_sales`.

In [34]:
df_cleaned.groupby(["prod_title"])["total_sales"].sum().nlargest(10)

prod_title
Reddy Beddy           261218.04
Cat Cave              113061.51
Kitty Climber          74247.37
Snoozer Hammock        65811.69
Snoozer Essentails     61236.45
Yum Fish-Dish          49516.35
Feline Fix Mix         42298.84
Scratchy Post          40873.28
Foozy Mouse            38763.25
Tuna Tasties           36497.70
Name: total_sales, dtype: float64

In [35]:
df6 = df_cleaned.groupby(["prod_title"])["total_sales"].sum().nlargest(10)
df6

prod_title
Reddy Beddy           261218.04
Cat Cave              113061.51
Kitty Climber          74247.37
Snoozer Hammock        65811.69
Snoozer Essentails     61236.45
Yum Fish-Dish          49516.35
Feline Fix Mix         42298.84
Scratchy Post          40873.28
Foozy Mouse            38763.25
Tuna Tasties           36497.70
Name: total_sales, dtype: float64

In [36]:
# your code here

top_tot_sales = df6.index
top_tot_sales

Index(['Reddy Beddy', 'Cat Cave', 'Kitty Climber', 'Snoozer Hammock',
       'Snoozer Essentails', 'Yum Fish-Dish', 'Feline Fix Mix',
       'Scratchy Post', 'Foozy Mouse', 'Tuna Tasties'],
      dtype='object', name='prod_title')

In [37]:
# Q9 Test Cases
check_q9()


[92mYour answer for the `top_tot_sales` variable looks about right!
Note that doesn't mean it's correct though, just that your answer is at least **close** to the correct answer. It's possible your answer isn't correct, although it's close! [0m


#### Question 10: Bonus

What is the proportion of returning customers? Store as variable `prop_returning`.

In [38]:
df_cleaned.cust_id.value_counts()

1005241    12
1007511    12
1001480    10
1011390     9
1001168     9
           ..
1012375     1
1014422     1
1010324     1
1004179     1
1001481     1
Name: cust_id, Length: 14096, dtype: int64

In [39]:
df_cleaned.cust_id.value_counts() > 1

1005241     True
1007511     True
1001480     True
1011390     True
1001168     True
           ...  
1012375    False
1014422    False
1010324    False
1004179    False
1001481    False
Name: cust_id, Length: 14096, dtype: bool

In [40]:
(df_cleaned.cust_id.value_counts() > 1).sum()

5804

In [41]:
df_cleaned.cust_id.nunique()

14096

In [42]:
df_cleaned.groupby(["cust_id"])["total_sales"].count()

cust_id
1001012    1
1001013    1
1001014    3
1001015    4
1001016    1
          ..
1015150    1
1015151    1
1015152    1
1015153    2
1015154    3
Name: total_sales, Length: 14096, dtype: int64

In [43]:
df7 = pd.DataFrame(df_cleaned.groupby(["cust_id"])["total_sales"].count())

df7

Unnamed: 0_level_0,total_sales
cust_id,Unnamed: 1_level_1
1001012,1
1001013,1
1001014,3
1001015,4
1001016,1
...,...
1015150,1
1015151,1
1015152,1
1015153,2


In [44]:
df8 = df7[df7.total_sales != 1]

df8

Unnamed: 0_level_0,total_sales
cust_id,Unnamed: 1_level_1
1001014,3
1001015,4
1001017,2
1001019,2
1001021,5
...,...
1015139,2
1015140,2
1015146,2
1015153,2


In [45]:
# your code here

prop_returning = round((5804/14096) * 100, 2)

prop_returning

41.17

In [46]:
# Q10 Test Cases
check_q10()


[91mYour answer `41.17` for the `prop_returning` isn't quite right.
You might want to check the order of your answer.
Take a closer look at your code to see what you can change.[0m
