_Lambda School Data Science_

# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [None]:
#!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz 

# Make sure we're in the top-level /content directory
#
# See below for notes on the cd command and why it's %cd instead of !cd
%cd /content

# Remove everything in the current working directory
#
# rm is the remove command
# -rf specifies the "recursive" and "force" options to remove all files in 
# subdirectories without prompting
#
# THIS IS A POWERFUL COMMAND! DO NOT run this command on your local computer - ever!!
#
# In this particular case, removing all of the files makes things easier if you
# need to re-run these examples by allowing you start with a clean directory
# every time.
!rm -rf *

# wget retrieves files from a remote location
!wget https://www.dropbox.com/s/pofcl26lvoj6073/instacart-market-basket-analysis.zip

In [None]:
# Unzip the archive
#
# Creates a new directory called instacart-market-basket-analysis

!unzip instacart-market-basket-analysis.zip

In [None]:
# Change into the newly-unzipped directory
#
# % sign is required to change to a new directory -- you can't use !cd like
# other commands
#
# Optional technical details:
#
# % makes the command apply to the **entire notebook environment**, which is
# what you need to do to change the working directory
#
# The ! sign **opens a new shell process** behind the scenes to execute the
# command -- this works fine for regular commands like unzip and ls
#
# Therefore, !cd would apply only to that new shell and wouldn't change the
# global notebook environment
#
# If this makes your heard hurt, don't worry too much about it. We'll talk
# more about the shell and operating systems stuff later in the program.

%cd instacart-market-basket-analysis

In [None]:
# Unzip all .csv.zip files in the directory
!unzip "*.zip"

In [None]:
# List all csv files in the current directory
# -l specifies the "long" listing format, which includes additional info on each file
# -h specifies "human readable" file size units
!ls -l -h *.csv

# Assignment

## Practice joining data

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

**Here is what you need to do:**

* First, write down which columns you need and which dataframes have them.
* Next, merge these into a single dataframe.
* Then, use pandas functions from the previous lesson to get the **counts of the top 10 most frequently ordered products**.

## Top 10 Most Frequently Ordered Products
We need product names and counts of how many times they were sold
- product_name is in products.csv
- instances of specific orders are in order_products__prior.csv and order_products__train.csv

In [None]:
##### YOUR CODE HERE #####


### Try and eyeball duplicate products in a single order

In [None]:
##### YOUR CODE HERE #####


### I'm not seeing any duplicates, but maybe we're just getting unlucky? Can we check for duplicates programmatically?

In [None]:
##### YOUR CODE HERE #####


In [None]:
### THIS CELL TESTS YOUR CODE ###
assert order_products.duplicated(subset=['order_id', 'product_id']).value_counts().iloc[0] == order_products.shape[0]

### Conclusion? - This dataset does not have any information about the quantity of items ordered, only unique items ordered and whether the shopper had bought any items in past visits. So our counts of how many times the top 10 products were ordered will really be the number of orders that the top 10 products were included in.

### In order to count the frequency of orders of a given product we need to combine orders and products so that we have names associated with the products in each order.

In [None]:
products.head()

In [None]:
# We can pass a list of column headers in order to select multiple columns
# the order of the columns will follow the order of the column headers 
# in the list
products[['department_id', 'product_name', 'product_id', 'aisle_id']]

In [None]:
##### YOUR CODE HERE #####


### Getting the final counts is all about just counting up the number of unique instances of product_name

In [None]:
##### YOUR CODE HERE #####


# Showcase Project Milestone

Watch the Showcase Project (formerly known as Build Week) kickoff video to get a sense of what you will accomplish over the next few weeks:
https://youtu.be/WYi9EXH-9lU

#Personal Development

Spend a little while today researching potential employers.  Pick a company you could be interested in working for, and read through the job skills required for a few different roles.  Even though you are on the Data Science track at Lambda School, the position you eventually get might not be called "data scientist".  Keep going (move on to more companies if you need to) until you've found 5 different roles that require data science skills.  