# Part III: Explore

![alt text](data/inu_neko_logo_small.png "Inu + Neko")

Hello! We are Inu + Neku and we are a Dog & Cat services and supplies store located in New York City. We just started our e-commerce business and need your help analyzing our data!

## Description

We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:

- Duplicate records
- Consistent formatting
- Missing values
- Obviously wrong values

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_cleaned = pd.read_csv('data/inu_neko_orderline_clean.csv')
df_cleaned

Unnamed: 0,trans_id,prod_upc,cust_id,trans_timestamp,trans_year,trans_month,trans_day,trans_hour,trans_quantity,cust_age,cust_state,prod_price,prod_title,prod_category,prod_animal_type,total_sales
0,10300097,719638485153,1001019,2021-01-01 07:35:21.439873,2021,1,1,1,1,20,New York,72.99,Cat Cave,bedding,cat,72.99
1,10300093,73201504044,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Purrfect Puree,treat,cat,18.95
2,10300093,719638485153,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,72.99,Cat Cave,bedding,cat,72.99
3,10300093,441530839394,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,2,34,New York,28.45,Ball and String,toy,cat,56.90
4,10300093,733426809698,1001015,2021-01-01 09:33:37.499660,2021,1,1,1,1,34,New York,18.95,Yum Fish-Dish,food,cat,18.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38218,10327860,287663658863,1022098,2021-06-30 15:37:12.821020,2021,6,30,30,1,25,New York,9.95,All Veggie Yummies,treat,dog,9.95
38219,10327960,140160459467,1022157,2021-06-30 15:45:09.872732,2021,6,30,30,2,31,Pennsylvania,48.95,Snoozer Essentails,bedding,dog,97.90
38220,10328009,425361189561,1022189,2021-06-30 15:57:44.295104,2021,6,30,30,2,53,New Jersey,15.99,Snack-em Fish,treat,cat,31.98
38221,10328089,733426809698,1022236,2021-06-30 15:59:29.801593,2021,6,30,30,1,23,Tennessee,18.95,Yum Fish-Dish,food,cat,18.95


In [3]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38223 entries, 0 to 38222
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   trans_id          38223 non-null  int64  
 1   prod_upc          38223 non-null  int64  
 2   cust_id           38223 non-null  int64  
 3   trans_timestamp   38223 non-null  object 
 4   trans_year        38223 non-null  int64  
 5   trans_month       38223 non-null  int64  
 6   trans_day         38223 non-null  int64  
 7   trans_hour        38223 non-null  int64  
 8   trans_quantity    38223 non-null  int64  
 9   cust_age          38223 non-null  int64  
 10  cust_state        38223 non-null  object 
 11  prod_price        38223 non-null  float64
 12  prod_title        38223 non-null  object 
 13  prod_category     38223 non-null  object 
 14  prod_animal_type  38223 non-null  object 
 15  total_sales       38223 non-null  float64
dtypes: float64(2), int64(9), object(5)
memor

#### Question 1: Number of Orders

How many transactions are there?

In [4]:
# your code here

num_trans = len(df_cleaned)

In [5]:
# Q1 Test Cases
assert 39000 > num_trans > 25000

#### Question 2: Alpha and Omega I
What was the month and day of the first sale? Store as a tuple in that order and assign the tuple to the variable `first_date`.

In [6]:
# your code here
mth = df_cleaned.loc[0]["trans_month"]
day = df_cleaned.loc[0]["trans_day"]
first_date = (mth,day)
first_date

(1, 1)

In [7]:
# Q2 Test Cases
assert len(first_date) == 2
assert str(first_date[0]).isnumeric() 
assert str(first_date[1]).isnumeric() 

#### Question 3: Alpha and Omega II
What was the month and day of the last sale? Store as a tuple in that order and assign the tuple to the variable `last_date`.

In [8]:
# your code here
mth = df_cleaned.loc[38222]["trans_month"]
day = df_cleaned.loc[38222]["trans_day"]
last_date = (mth,day)
last_date

(6, 30)

In [9]:
# Q3 Test Cases
assert len(last_date) == 2
assert str(last_date[0]).isnumeric() 
assert str(last_date[1]).isnumeric() 

#### Question 4: Cats vs Dogs

Which animal product type is most popular?

In [10]:
# your code here
df_cleaned.prod_animal_type.value_counts()
names = df_cleaned.prod_animal_type.value_counts().keys()
most_pop = names[0]

In [11]:
# Q4 Test Cases
assert type(most_pop) == str
assert most_pop in ['cat', 'dog']

#### Question 5: More Money More Problems I

What was the total dollar amount made in the month of January? Store this in the variable `jan_rev`.

In [12]:
# your code here
sales_jan = df_cleaned[df_cleaned["trans_month"] == 1]
jan_rev = sales_jan["total_sales"].sum()

In [13]:
# Q5 Test Cases
assert  40000 <= jan_rev <= 60000

#### Question 6: More Money More Problems II

What was the total dollar amount made in the month of January? Store this in the variable `june_rev`.

In [14]:
# your code here
sales_june = df_cleaned[df_cleaned["trans_month"] == 6]
june_rev = sales_june["total_sales"].sum()

In [15]:
# Q6 Test Cases
assert  500000 <= june_rev <= 600000

#### Question 7: Transaction Size

What is the average number of items bought in each transaction? Sore this in the variable `avg_num_items`.

In [16]:
# your code here

avg_num_items = df_cleaned['trans_quantity'].sum()/df_cleaned['trans_id'].drop_duplicates().count()

In [17]:
# Q7 Test Cases
assert  0 <= avg_num_items <= 3

#### Question 8: Best Products I

What are the top ten product titles by the number of sales? Display in descending order. Store in variable `top_num_sales`.

In [18]:
# your code here

top_num_sales = df_cleaned['prod_title'].value_counts().head(10)

In [19]:
# Q8 Test Cases
assert len(top_num_sales) == 10

#### Question 9: Best Products II

What are the top ten product titles by total dollar amount made? Display in descending order. Store in variable `top_tot_sales`.

In [20]:
# your code here

top_num_sales = df_cleaned.groupby('prod_title').agg('total_sales').sum().reset_index()
top_tot_sales =top_num_sales.sort_values("total_sales",ascending=False).head(10)
top_tot_sales=top_tot_sales["prod_title"]
top_tot_sales

11           Reddy Beddy
2               Cat Cave
8          Kitty Climber
15       Snoozer Hammock
14    Snoozer Essentails
20         Yum Fish-Dish
12         Scratchy Post
5         Feline Fix Mix
7            Foozy Mouse
18          Tuna Tasties
Name: prod_title, dtype: object

In [21]:
# Q9 Test Cases
assert len(top_tot_sales) == 10

#### Question 10: Bonus

What is the proportion of returning customers? Store as variable `prop_returning`.

In [22]:
# your code here
df2 = pd.DataFrame(df_cleaned.groupby("cust_id")["trans_id"].count())
returncustomers = (df2["trans_id"] > 1).sum()
prop_returning = returncustomers / len(df2)

In [23]:
# Q10 Test Cases
assert 0 <= prop_returning <= 1