<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice SQL with `pandas`, Pt. 2

_Authors: Sam Stack (DC)_

---

We've learned about relational databases and the language most use to query them: SQL.  

In this lab, we are going to get more practice converting information to a SQL database, querying that data, and then analyzing it with Python.

In [1]:
# Necessary libraries:
import pandas as pd
import sqlite3
from pandas.io import sql

#### 1.  Read in the `EuroMart` `.csv` data.
- 'EuroMart-ListOfOrders.csv'
- 'EuroMart-OrderBreakdown.csv'
- 'EuroMart-SalesTargets.csv'

In [2]:
# Reading in the `.csv` to a DataFrame:
orders = pd.read_csv('../datasets/csv/EuroMart-ListOfOrders.csv', encoding = 'utf-8')
OBD =  pd.read_csv('../datasets/csv/EuroMart-OrderBreakdown.csv', encoding = 'utf-8')
sales_targets =  pd.read_csv('../datasets/csv/EuroMart-SalesTargets.csv', encoding = 'utf-8')

#### 2. Rename columns to remove any spaces.

In [3]:
# Renaming columns to remove spaces:
orders.columns = ['order_id','order_date','customer_name','city','country','region',
                        'segment','ship_date','ship_mode','state']
OBD.columns = ['order_id','product_name','discount','sales','profit','quantity',
          'category','sub-category']
 
sales_targets.columns = ['month_of_order_date','category','target']

#### 3. Remove dollar signs from the `sales` and `profit` columns in the `order breakdown` DataFrame.

Convert the columns to float.

In [4]:
# Removing dollar signs from the `sales` and `profit` columns:
OBD['sales'] = OBD['sales'].map(lambda x: x.strip('$'))
OBD['sales'] = OBD['sales'].map(lambda x: float(x.replace(',','')))

OBD['profit'] = OBD['profit'].map(lambda x: x.replace('$',''))
OBD['profit'] = OBD['profit'].map(lambda x: float(x.replace(',','')))


In [5]:
OBD['sales'] = OBD['sales'].apply(lambda x: x.strip('$'))

AttributeError: 'float' object has no attribute 'strip'

In [None]:
OBD['sales']

#### 4. Create a SQL database called `EuroMart` and save the three DataFrames as SQL tables. 

In [None]:
# Establishing a local DB connection:
db_connection = sqlite3.connect('../datasets/sql/EuroMart.db.sqlite')

# # Reading out DataFrames as SQL tables:
orders.to_sql(name = 'orders', con = db_connection, if_exists = 'replace', index = False)
OBD.to_sql(name = 'order_breakdown', con = db_connection, if_exists = 'replace', index = False)
sales_targets.to_sql(name = 'sales_targets', con = db_connection, if_exists = 'replace', index = False)


In [None]:
# Getting the column Labels:  
orders.head(1)

In [None]:
OBD.head(1)

In [None]:
sales_targets.head(1)

#### 5. How many orders has each customer placed? 

In [None]:
# solution 1
query = '''
        SELECT customer_name, count(*) 
        FROM orders 
        GROUP BY customer_name
        ORDER BY count(*) DESC'''
sql.read_sql(query, con = db_connection).head()

In [None]:
# solution 2

# Getting all customer names and setting them to a `pandas` object:
customers = sql.read_sql('SELECT customer_name FROM orders', con = db_connection)

# Counting unique values in the list:
customers['customer_name'].value_counts().head()

> *If you're doubting your output, check using `pandas`.*

#### 6. Create a query to return a table containing only geographic features from the `list of orders` table.

In [None]:
# City, country, region, and state are all geographic.
sql.read_sql('SELECT city, country, region, state FROM orders', con = db_connection).head()

#### 7. Create a query to return a table containing all orders that had a negative profit from the `order breakdown` table.

In [None]:
# Identifying any cell in the `profit` column with a '-' sign:
sql.read_sql('''SELECT * 
                FROM order_breakdown 
                WHERE profit LIKE "-%"
                ''', con = db_connection).head()
# We had not converted values ints prior to writing this.  
# It works with ints and objects!

#### 8. Construct a query to return a table containing `customer_name` and `product_name`.  

> **Note:** This will require a JOIN!

In [None]:
query = '''
        SELECT orders."order_id"
              ,orders."customer_name"
              ,order_breakdown."product_name"
        FROM orders
             LEFT JOIN order_breakdown ON orders."order_id"= order_breakdown."order_id"
        '''

sql.read_sql(query,con = db_connection).head()


#### 9.  How many orders for "office supplies" (category) has Sweden made?

> **Note:** From this point on, you'll probably be combining SQL and `pandas`, in that you would use SQL queries to gather relevant information and then `pandas` to analyze it.

In [None]:
query = '''
        SELECT  orders."order_id"
               ,orders."country"
               ,order_breakdown."category"          
        FROM orders
                LEFT JOIN order_breakdown ON orders."order_id"= order_breakdown."order_id"
        WHERE orders."country" = "Sweden" 
              AND order_breakdown."category"="Office Supplies"
        '''


swedish_supplies = sql.read_sql(query, con = db_connection)
len(swedish_supplies)

#### 10. What were total sales for discounted products? 

In [None]:
discount_sales = sql.read_sql('SELECT discount, sales FROM order_breakdown WHERE discount > 0',
                              con = db_connection)

discount_sales['sales'].sum()


#### 11. What is the total quantity of objects sold for each country?

In [None]:
# solution with sql + pandas

order_counts = sql.read_sql('''SELECT order_breakdown."quantity", orders."country"
                            FROM orders
                                 INNER JOIN order_breakdown 
                                     ON orders."order_id"= order_breakdown."order_id"
                            ''',
            con = db_connection)

order_counts.groupby('country').sum()


In [None]:
# solution with sql

query = '''
        SELECT orders."country", sum(order_breakdown."quantity") AS sum
        FROM orders
             INNER JOIN order_breakdown
                ON orders."order_id"= order_breakdown."order_id"
        GROUP BY orders."country"
        '''

sql.read_sql(query,con = db_connection)

#### 12. In what countries were profits lowest? (Report the lowest 5-10).

In [None]:
# Solution with sql + pandas

# Gather `country` and `profit`. 
profits = sql.read_sql('SELECT order_breakdown."profit", orders."country" '
                        'FROM orders '
                        'INNER JOIN order_breakdown '
                        'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection)

# GROUP BY country and sum with sort on `profit`.
profits.groupby('country').sum().sort_values('profit').reset_index()[5:11]

In [None]:
# Solution with SQL ONLY 
# NOTE: ORDER BY AN AGGREGATE FUNCTION!

query = '''
        SELECT orders."country", SUM(order_breakdown."profit")
        FROM orders
                INNER JOIN order_breakdown
                ON orders."order_id"= order_breakdown."order_id"
        GROUP BY orders."country"
        ORDER BY SUM(order_breakdown."profit") 
        '''
sql.read_sql(query,con = db_connection)[5:11]

#### 13. What counties have the best and worst sales-to-profit ratios?
(Total sales divided by total profits).

Essentially, this is asking us to determine the profit made for every dollar of product sold.

In [None]:
# Total profits/Total sales
# Grabbing profits, sales, and countries:
spr = sql.read_sql('SELECT order_breakdown."profit",order_breakdown."sales", orders."country" '
                    'FROM orders '
                    'INNER JOIN order_breakdown '
                    'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection)

# Summing profits and sales by country:
spr2 = spr.groupby('country').sum().sort_values('profit')

# Creating the ratio column:
spr2['ratio'] = spr2['profit']/spr2['sales']

# Sorting by ratio column:
spr2.sort_values('ratio', ascending = False)

In [None]:
query = '''
        SELECT orders."country"
              ,SUM(order_breakdown."profit") / SUM(order_breakdown."sales")  AS ps_ratio
        FROM orders
        INNER JOIN order_breakdown
        ON orders."order_id"= order_breakdown."order_id" 
        GROUP BY orders."country"
        ORDER BY SUM(order_breakdown."profit") / SUM(order_breakdown."sales")
        '''
data = sql.read_sql(query, con = db_connection)
print(data.head(1))
print(data.tail(1))

#### 14. What shipping method is most commonly used for "bookcases" (sub-category)?

In [None]:
sql.read_sql('SELECT orders."ship_mode",order_breakdown."sub-category"'
            'FROM orders '
            'INNER JOIN order_breakdown '
            'ON orders."order_id"= order_breakdown."order_id" '
            'WHERE "sub-category" = "Bookcases"' ,
            con = db_connection)['ship_mode'].value_counts()



#### 15. Which city in the `orders` table generated the highest net sales? (List all cities and countries in descending order by net sales).

In [None]:
sql.read_sql('SELECT orders."city",orders."country", order_breakdown."sales"'
                'FROM orders '
                'INNER JOIN order_breakdown '
                'ON orders."order_id"= order_breakdown."order_id" ',
con = db_connection).groupby(['city','country']).sum().sort_values('sales', ascending = False)

####  BONUS: Create a column called `shipping_delay` in the `orders` table that contains the difference in days between `order_date` and `ship_date`.

In [None]:
# Converting columns to datetime objects from objects:
orders['order_date'] = pd.to_datetime(orders['order_date'])
orders['ship_date'] = pd.to_datetime(orders['ship_date'])

In [None]:
# Engineering a feature that counts the difference in days:
orders['ship_delay'] = (orders['ship_date']-orders['order_date']).astype('timedelta64[h]')/24

# Or, just use `timedelta64[D]` to get days.


#### BONUS: Update your `orders` table in your SQLite3 DB to include the `shipping_delay` feature.

In [None]:
# Updating and replacing the `order` data table:
orders.to_sql(name = 'orders', con = db_connection, if_exists = 'replace', index = False)


#### BONUS: Which product category has the highest average `shipping_delay`?

In [None]:
sql.read_sql('SELECT orders."ship_delay", order_breakdown."category"'
            'FROM orders '
            'INNER JOIN order_breakdown '
            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection).groupby('category').mean()

### Challenge

**In which months and categories were sales targets exceeded?**

---

This may require a considerable amount of data processing.

In [None]:
# First I'm going to extract the information I need using SQL:
month_sales = sql.read_sql('SELECT orders."order_date", order_breakdown."sales",order_breakdown."category" '
             'FROM orders '
             'INNER JOIN order_breakdown '
             'ON orders."order_id" = order_breakdown."order_id" ', 
             con = db_connection)

# Convert `order_date` to a datetime object.
month_sales["order_date"] = pd.to_datetime(month_sales["order_date"])

# Create a column that aggregates dates in 'mon-yy' format.
month_sales['mnth_yr'] = month_sales['order_date'].apply(lambda x: x.strftime('%b-%y'))


In [None]:
# Taking the new date objects and using them to GROUP BY to determine the sum of sales:
month_sales = month_sales.groupby(['mnth_yr','category']).sales.sum().reset_index()


In [None]:
# Pushing this new DataFrame, which was created with monthly aggregates, back to a local SQL DB:
month_sales.to_sql(name = 'sales_by_month', con = db_connection, if_exists = 'replace', index = False)

In [None]:
# Extracting information again, joining the newly created table and the `sales_targets` table:
targets = sql.read_sql('SELECT sales_targets."month_of_order_date", sales_targets."category", sales_targets."target",sales_by_month."sales"'
                      'FROM sales_targets '
                      'INNER JOIN sales_by_month '
                      'ON sales_targets."month_of_order_date" = sales_by_month."mnth_yr" AND '
                      'sales_targets."category" = sales_by_month."category"',
                      con = db_connection)
# This is a double JOIN in that it matches values in two columns.

In [None]:
# Removing string values and converting `targets` to a float dtype:
targets['target'] = targets['target'].map(lambda x: x.replace('$',''))
targets['target'] = (targets['target'].map(lambda x: x.replace(',',''))).astype(float)


In [None]:
# Creating a Boolean list that states whether or not sales exceeded their targets:
exceeded = []
for ind in range(len(targets['target'])):
    if targets['target'][ind] > targets['sales'][ind]:
        exceeded.append(False)
    elif targets['target'][ind] < targets['sales'][ind]:
        exceeded.append(True)

In [None]:
# Appending the list to the DataFrame as a column:
targets['exceeded'] = exceeded

In [None]:
# Getting those values that exceed targets:
targets[targets['exceeded'] == True]

**In what months and categories did sales fail to exceed their targets?**

In [None]:
# Getting those values that did not exceed expectations:

targets[targets['exceeded'] == False]