# Table of Contents 

This notebook contains the following: 

* Importing libraries 
* Importing data sets 
* Identifying and changing variable type in df_ords dataframe to a suitable format 
* Identifying the busiest hours of the day 
* Using a data dictionary to find meaning behind department_id column
* Creation of subset for breakfast item sales 
* Creating different subsets from dataframes to present to client 
* Counting total rows of dataframe 
* Identifying inaccuracies 
* Exporting dataframes 


# Step 1 - Wrangling Procedures from Exercise

# 01. Importing libraries

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import os 

In [None]:
#Turn project folder path into a string
'/Users/aysha/Documents/Instacart Basket Analysis/'

In [None]:
path = r'/Users/aysha/Documents/Instacart Basket Analysis/'

In [None]:
path

In [None]:
# Import the “orders.csv” file into Jupyter as df_ords
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [None]:
df_ords.head()

In [None]:
#Dropping eval_set from orders.csv
df_ords.drop(columns = ['eval_set'])

In [None]:
# Overwriting df_ords to update it without eval_set
df_ords = df_ords.drop(columns = ['eval_set'])

In [None]:
df_ords

In [None]:
# Check for missing values in column 'days_since_prior_order'
df_ords['days_since_prior_order'].value_counts(dropna = False)

In [None]:
# Renaming columns orders_dow to orders_day_of_week
df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

In [None]:
df_ords.head()

In [None]:
# Check first 5 rows of df_ords
df_ords.head()

In [None]:
df_ords.describe()

In [None]:
# Converting data type of order_id to string
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [None]:
df_ords['order_id'].dtype

In [None]:
# Importing departments.csv data set using os library
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'departments.csv'), index_col = False)

In [None]:
#Print the first 5 rows
df_dep.head()

In [None]:
# Changing from wide format to long format
df_dep.T

In [None]:
# Creating a new dataframe from tranposed version
df_dep_t = df_dep.T

In [None]:
df_dep_t

In [None]:
# Adding index to the transposed dataframe
df_dep_t.reset_index()

In [None]:
# Take the first row of df_dep_t for the header 
new_header = df_dep_t.iloc[0]

In [None]:
new_header

In [None]:
# Create new dataframe copying only rows beyond the first row
df_dep_t_new = df_dep_t[1:]

In [None]:
df_dep_t_new

In [None]:
# Set the header row as the df header 
df_dep_t_new.columns = new_header

In [None]:
df_dep_t_new

In [None]:
# Turn dataframe deps (departments.csv) into data dictionary
data_dict = df_dep_t_new.to_dict('index')

In [None]:
data_dict

In [None]:
# Import the “products.csv” file into Jupyter as df_prods 
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [None]:
df_prods.head()

In [None]:
print(data_dict.get('19'))

In [None]:
# Create subset of df_prods dataframe
df_snacks =  df_prods[df_prods['department_id']==19]

In [None]:
df_prods['department_id']==19

In [None]:
df_snacks.head()

In [None]:
# Another way of creating subset of dataframe (df)
df_snacks_2 = df_prods.loc[df_prods['department_id'] == 19]

In [None]:
df_snacks_2

In [None]:
# And another way of creating subset of dataframe
df_snacks_3 = df_prods.loc[df_prods['department_id'].isin([19])]

In [None]:
df_snacks_3

# 02. Procedures for Task (Steps 2 - 10)

### Step 2 - Find another identifier variable in the df_ords dataframe that doesn’t need to be included in your analysis as a numeric variable and change it to a suitable format.

In [None]:
# Step 2 - Converting data type of user_id to string
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [None]:
df_ords['user_id'].dtype

### Step 3 - Look for a variable in your df_ords dataframe with an unintuitive name and change its name without overwriting the data frame.

In [None]:
#Step 3 - Overwriting / Renaming the variable order_hour_of_day to order_time_of_day
df_ords.rename(columns = {'order_hour_of_day' : 'order_time_of_day'}, inplace = True)

In [None]:
# Check first 5 rows of df_ords
df_ords.head()

### Step 4 - Your client wants to know what the busiest hour is for placing orders. Find the frequency of the corresponding variable and share your findings.

In [None]:
# Step 4 - Busiest hour for placing orders
df_ords['order_time_of_day'].describe()

In [None]:
# Converting the data type for variable order_time_of_day from a float to integer
df_ords['order_time_of_day'] = df_ords['order_time_of_day'].astype('int')

In [None]:
df_ords['order_time_of_day'].dtype

In [None]:
# Check frequency
df_ords['order_time_of_day'].value_counts(dropna = False)

### Answer: The busiest time of day is at 10 with 288,418 orders placed at that time. 

### Step 5 - Determine the meaning behind a value of 4 in the "department_id" column within the df_prods dataframe using a data dictionary.

In [None]:
# Meaning behind 4 in the department_id
print(data_dict.get('4'))

### Step 6 - The sales team in your client’s organization wants to know more about breakfast item sales. Create a subset containing only the required information.

In [None]:
# Subset for breakfast item sales
df_breakfast =  df_prods[df_prods['department_id']==14]

In [None]:
df_breakfast

### Step 7 - They’d also like to see details about customers who might be throwing dinner parties. Your task is to find all observations from the entire dataframe that include items from the following departments: alcohol, deli, beverages, and meat/seafood. You’ll need to present this subset to your client.

In [None]:
# Subset for alcohol, deli, beverages and meat seafood item sales
df_alcohol =  df_prods[df_prods['department_id']==5]
df_deli =  df_prods[df_prods['department_id']==20]
df_beverages =  df_prods[df_prods['department_id']==7]
df_meat_seafood =  df_prods[df_prods['department_id']==12]

In [None]:
df_alcohol

In [None]:
df_deli

In [None]:
df_beverages

In [None]:
df_meat_seafood

### Step 8 - It’s important that you keep track of total counts in your dataframes. How many rows does the last dataframe you created have?

In [None]:
# Subset for dataframe dinner 
df_dinner = df_prods.loc[df_prods['department_id'].isin([5,20,7,12])]

In [None]:
df_dinner

### Step 9 - Someone from the data engineers team in Instacart thinks they’ve spotted something strange about the customer with a "user_id" of “1.” Extract all the information you can about this user.

In [None]:
# Extracting information about the customer with user_id of 1
df_ords['user_id']==1

In [None]:
df_ords.head()

In [None]:
# Extracting information about the customer with user_id of 1
df_ords_user_id_1 = df_ords[df_ords['user_id']=='1']

In [None]:
df_ords_user_id_1

### Step 10 - You also need to provide some details about this user’s behavior. What basic stats can you provide based on the information you have?

In [None]:
# Some basic stats on customer with user_id as 1
df_ords_user_id_1.describe()

### Step 11 - Check the organization and structure of your notebook. Be sure to include section headings and code comments.

### Step 12 - Export your df_ords dataframe as “orders_wrangled.csv” in your “Prepared Data” folder.

In [None]:
# Exporting dataframe as .csv file
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_wrangled.csv'))

### Step 13 - Export the df_dep_t_new dataframe as “departments_wrangled.csv” in your “Prepared Data” folder so that you have a “.csv” file of your departments data in the correct format.

In [None]:
df_dep_t_new

In [None]:
# Exporting dataframe as .csv file
df_dep_t_new.to_csv(os.path.join(path, '02 Data','Prepared Data', 'departments_wrangled.csv'))