<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice SQL with `pandas`, Pt. 2

_Authors: Sam Stack (DC)_

---

We've learned about relational databases and the language most use to query them: SQL.  

In this lab, we are going to get more practice converting information to a SQL database, querying that data, and then analyzing it with Python.

In [1]:
# Necessary libraries:
import pandas as pd
import sqlite3
from pandas.io import sql

#### 1.  Read in the `EuroMart` `.csv` data.
- 'EuroMart-ListOfOrders.csv'
- 'EuroMart-OrderBreakdown.csv'
- 'EuroMart-SalesTargets.csv'

In [37]:
# Reading in the `.csv` to a DataFrame:
orders = pd.read_csv('../datasets/csv/EuroMart-ListOfOrders.csv', encoding = 'utf-8')
OBD =  pd.read_csv('../datasets/csv/EuroMart-OrderBreakdown.csv', encoding = 'utf-8')
sales_targets =  pd.read_csv('../datasets/csv/EuroMart-SalesTargets.csv', encoding = 'utf-8')

#### 2. Rename columns to remove any spaces.

In [38]:
# Renaming columns to remove spaces:
orders.columns = ['order_id','order_date','customer_name','city','country','region',
                        'segment','ship_date','ship_mode','state']
OBD.columns = ['order_id','product_name','discount','sales','profit','quantity',
          'category','sub-category']
 
sales_targets.columns = ['month_of_order_date','category','target']

#### 3. Remove dollar signs from the `sales` and `profit` columns in the `order breakdown` DataFrame.

Convert the columns to float.

In [39]:
# Removing dollar signs from the `sales` and `profit` columns:
OBD['sales'] = OBD['sales'].map(lambda x: x.strip('$'))
OBD['sales'] = OBD['sales'].map(lambda x: float(x.replace(',','')))

OBD['profit'] = OBD['profit'].map(lambda x: x.replace('$',''))
OBD['profit'] = OBD['profit'].map(lambda x: float(x.replace(',','')))


#### 4. Create a SQL database called `EuroMart` and save the three DataFrames as SQL tables. 

In [40]:
# Establishing a local DB connection:
db_connection = sqlite3.connect('../datasets/sql/EuroMart.db.sqlite')

# # Reading out DataFrames as SQL tables:
orders.to_sql(name = 'orders', con = db_connection, if_exists = 'replace', index = False)
OBD.to_sql(name = 'order_breakdown', con = db_connection, if_exists = 'replace', index = False)
sales_targets.to_sql(name = 'sales_targets', con = db_connection, if_exists = 'replace', index = False)


In [6]:
# Getting the column Labels:  
orders.head(1)

Unnamed: 0,order_id,order_date,customer_name,city,country,region,segment,ship_date,ship_mode,state
0,BN-2011-7407039,1/1/2011,Ruby Patel,Stockholm,Sweden,North,Home Office,1/5/2011,Economy Plus,Stockholm


In [7]:
OBD.head(1)

Unnamed: 0,order_id,product_name,discount,sales,profit,quantity,category,sub-category
0,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper


In [8]:
sales_targets.head(1)

Unnamed: 0,month_of_order_date,category,target
0,Jan-11,Furniture,"$10,000.00"


#### 5. How many orders has each customer placed? 

In [9]:
# Getting all customer names and setting them to a `pandas` object:
customers = sql.read_sql('SELECT customer_name FROM orders', con = db_connection)

# Counting unique values in the list:
customers['customer_name'].value_counts().head()

Jose Gambino       13
Mark Washington    12
Kayla Tearle       12
Terence Welch      11
Maya Pamphlett     11
Name: customer_name, dtype: int64

> *If you're doubting your output, check using `pandas`.*

#### 6. Create a query to return a table containing only geographic features from the `list of orders` table.

In [10]:
# City, country, region, and state are all geographic.
sql.read_sql('SELECT city, country, region, state FROM orders', con = db_connection).head()

Unnamed: 0,city,country,region,state
0,Stockholm,Sweden,North,Stockholm
1,Southport,United Kingdom,North,England
2,Valence,France,Central,Auvergne-Rhône-Alpes
3,Birmingham,United Kingdom,North,England
4,Echirolles,France,Central,Auvergne-Rhône-Alpes


#### 7. Create a query to return a table containing all orders that had a negative profit from the `order breakdown` table.

In [11]:
# Identifying any cell in the `profit` column with a '-' sign:
sql.read_sql('SELECT * from order_breakdown WHERE profit LIKE "%-%"', con = db_connection).head()
# We had not converted values ints prior to writing this.  
# It works with ints and objects!

Unnamed: 0,order_id,product_name,discount,sales,profit,quantity,category,sub-category
0,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper
1,BN-2011-2819714,"Boston Markers, Easy-Erase",0.5,27.0,-22.0,2,Office Supplies,Art
2,BN-2011-2819714,"Eldon Folders, Single Width",0.5,17.0,-1.0,2,Office Supplies,Storage
3,BN-2011-3248724,"Ikea Classic Bookcase, Metal",0.6,987.0,-1012.0,6,Furniture,Bookcases
4,BN-2011-3248724,"Binney & Smith Sketch Pad, Blue",0.5,116.0,-56.0,5,Office Supplies,Art


#### 8. Construct a query to return a table containing `customer_name` and `product_name`.  

> **Note:** This will require a JOIN!

In [12]:
sql.read_sql('SELECT orders."order_id", orders."customer_name", order_breakdown."product_name"'
'FROM orders '
'LEFT JOIN order_breakdown '
'ON orders."order_id"= order_breakdown."order_id"',
            con = db_connection).head()


Unnamed: 0,order_id,customer_name,product_name
0,BN-2011-7407039,Ruby Patel,"Enermax Note Cards, Premium"
1,AZ-2011-9050313,Summer Hayward,"Dania Corner Shelving, Traditional"
2,AZ-2011-6674300,Devin Huddleston,"Binney & Smith Sketch Pad, Easy-Erase"
3,BN-2011-2819714,Mary Parker,"Boston Markers, Easy-Erase"
4,BN-2011-2819714,Mary Parker,"Eldon Folders, Single Width"


#### 9.  How many orders for "office supplies" (category) has Sweden made?

> **Note:** From this point on, you'll probably be combining SQL and `pandas`, in that you would use SQL queries to gather relevant information and then `pandas` to analyze it.

In [13]:
sweedish_supplies = sql.read_sql('SELECT orders."order_id", orders."country", order_breakdown."category" '            
'FROM orders '
'LEFT JOIN order_breakdown '
'ON orders."order_id"= order_breakdown."order_id"'
'WHERE orders."country" = "Sweden" and order_breakdown."category"="Office Supplies"',
            con = db_connection)

sweedish_supplies.count()

order_id    133
country     133
category    133
dtype: int64

#### 10. What were total sales for discontinued products? 

In [14]:
discount_sales = sql.read_sql('SELECT discount, sales FROM order_breakdown WHERE discount > 0',
                              con = db_connection)

discount_sales['sales'].sum()


1115614.0

#### 11. What is the total quantity of objects sold for each country?

In [15]:
order_counts = sql.read_sql('SELECT order_breakdown."quantity", orders."country" '
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection)

order_counts.groupby('country').sum()


Unnamed: 0_level_0,quantity
country,Unnamed: 1_level_1
Austria,973
Belgium,532
Denmark,204
Finland,201
France,7329
Germany,6179
Ireland,392
Italy,3612
Netherlands,1526
Norway,261


#### 12. In what countries were profits lowest? (Report the lowest 5-10).

In [36]:
# Gather `country` and `profit`. 
profits = sql.read_sql('SELECT order_breakdown."profit", orders."country" '
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection)

# GROUP BY country and sum with sort on `profit`.
profits.groupby('country').sum().sort_values('profit').reset_index()[5:11]

Unnamed: 0,country,profit
5,Finland,3908.0
6,Norway,5167.0
7,Switzerland,7234.0
8,Belgium,9912.0
9,Italy,15802.0
10,Austria,21332.0


#### 13. What counties have the best and worst sales-to-profit ratios?
(Total sales divided by total profits).

Essentially, this is asking us to determine the profit made for every dollar of product sold.

In [19]:
# Total profits/Total sales
# Grabbing profits, sales, and countries:
spr = sql.read_sql('SELECT order_breakdown."profit",order_breakdown."sales", orders."country" '
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection)

# Summing profits and sales by country:
spr2 = spr.groupby('country').sum().sort_values('profit')

# Creating the ratio column:
spr2['ratio'] = spr2['profit']/spr2['sales']

# Sorting by ratio column:
spr2.sort_values('ratio', ascending = False)

Unnamed: 0_level_0,profit,sales,ratio
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Switzerland,7234.0,24874.0,0.290826
Austria,21332.0,79382.0,0.268726
Norway,5167.0,20529.0,0.251693
Belgium,9912.0,42320.0,0.234216
United Kingdom,90382.0,420497.0,0.214941
Finland,3908.0,20702.0,0.188774
Spain,47067.0,249402.0,0.188719
Germany,86279.0,488681.0,0.176555
France,70067.0,609683.0,0.114924
Italy,15802.0,252742.0,0.062522


#### 14. What shipping method is most commonly used for "bookcases" (sub-category)?

In [20]:
sql.read_sql('SELECT orders."ship_mode",order_breakdown."sub-category"'
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" '
                            'WHERE "sub-category" = "Bookcases"' ,
            con = db_connection)['ship_mode'].value_counts()



Economy         234
Economy Plus     76
Priority         59
Immediate        22
Name: ship_mode, dtype: int64

#### 15. Which city in the `orders` table generated the highest net sales? (List all cities and countries in descending order by net sales).

In [22]:
sql.read_sql('SELECT orders."city",orders."country", order_breakdown."sales"'
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection).groupby(['city','country']).sum().sort_values('sales', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
city,country,Unnamed: 2_level_1
London,United Kingdom,69230.0
Berlin,Germany,52555.0
Vienna,Austria,51844.0
Madrid,Spain,44981.0
Paris,France,42245.0
Rome,Italy,28330.0
Barcelona,Spain,27405.0
Hamburg,Germany,23574.0
Marseille,France,21677.0
Turin,Italy,19829.0


####  BONUS: Create a column called `shipping_delay` in the `orders` table that contains the difference in days between `order_date` and `ship_date`.

In [23]:
# Converting columns to datetime objects from objects:
orders['order_date'] = pd.to_datetime(orders['order_date'])
orders['ship_date'] = pd.to_datetime(orders['ship_date'])

In [24]:
# Engineering a feature that counts the difference in days:
orders['ship_delay'] = (orders['ship_date']-orders['order_date']).astype('timedelta64[h]')/24

# Or, just use `timedelta64[D]` to get days.


#### BONUS: Update your `orders` table in your SQLite3 DB to include the `shipping_delay` feature.

In [25]:
# Updating and replacing the `order` data table:
orders.to_sql(name = 'orders', con = db_connection, if_exists = 'replace', index = False)


#### BONUS: Which product category has the highest average `shipping_delay`?

In [26]:
sql.read_sql('SELECT orders."ship_delay", order_breakdown."category"'
                            'FROM orders '
                            'INNER JOIN order_breakdown '
                            'ON orders."order_id"= order_breakdown."order_id" ',
            con = db_connection).groupby('category').mean()

Unnamed: 0_level_0,ship_delay
category,Unnamed: 1_level_1
Furniture,4.0
Office Supplies,3.975028
Technology,4.12541


### Challenge

**In which months and categories were sales targets exceeded?**

---

This may require a considerable amount of data processing.

In [27]:
# First I'm going to extract the information I need using SQL:
month_sales = sql.read_sql('SELECT orders."order_date", order_breakdown."sales",order_breakdown."category" '
             'FROM orders '
             'INNER JOIN order_breakdown '
             'ON orders."order_id" = order_breakdown."order_id" ', 
             con = db_connection)

# Convert `order_date` to a datetime object.
month_sales["order_date"] = pd.to_datetime(month_sales["order_date"])

# Create a column that aggregates dates in 'mon-yy' format.
month_sales['mnth_yr'] = month_sales['order_date'].apply(lambda x: x.strftime('%b-%y'))


In [28]:
# Taking the new date objects and using them to GROUP BY to determine the sum of sales:
month_sales = month_sales.groupby(['mnth_yr','category']).sales.sum().reset_index()


In [29]:
# Pushing this new DataFrame, which was created with monthly aggregates, back to a local SQL DB:
month_sales.to_sql(name = 'sales_by_month', con = db_connection, if_exists = 'replace', index = False)

In [30]:
# Extracting information again, joining the newly created table and the `sales_targets` table:
targets = sql.read_sql('SELECT sales_targets."month_of_order_date", sales_targets."category", sales_targets."target",sales_by_month."sales"'
                      'FROM sales_targets '
                      'INNER JOIN sales_by_month '
                      'ON sales_targets."month_of_order_date" = sales_by_month."mnth_yr" AND '
                      'sales_targets."category" = sales_by_month."category"',
                      con = db_connection)
# This is a double JOIN in that it matches values in two columns.

In [31]:
# Removing string values and converting `targets` to a float dtype:
targets['target'] = targets['target'].map(lambda x: x.replace('$',''))
targets['target'] = (targets['target'].map(lambda x: x.replace(',',''))).astype(float)


In [32]:
# Creating a Boolean list that states whether or not sales exceeded their targets:
exceeded = []
for ind in range(len(targets['target'])):
    if targets['target'][ind] > targets['sales'][ind]:
        exceeded.append(False)
    elif targets['target'][ind] < targets['sales'][ind]:
        exceeded.append(True)

In [33]:
# Appending the list to the DataFrame as a column:
targets['exceeded'] = exceeded

In [34]:
# Getting those values that exceed targets:
targets[targets['exceeded'] == True]

Unnamed: 0,month_of_order_date,category,target,sales,exceeded
1,Feb-11,Furniture,10100.0,13541.0,True
5,Jun-11,Furniture,10600.0,14737.0,True
8,Sep-11,Furniture,11000.0,13763.0,True
10,Nov-11,Furniture,11300.0,15194.0,True
11,Dec-11,Furniture,11400.0,23611.0,True
17,Jun-12,Furniture,12100.0,21661.0,True
19,Aug-12,Furniture,12400.0,21300.0,True
20,Sep-12,Furniture,12500.0,20161.0,True
21,Oct-12,Furniture,12600.0,14923.0,True
22,Nov-12,Furniture,12800.0,15100.0,True


**In what months and categories did sales fail to exceed their targets?**

In [35]:
# Getting those values that did not exceed expectations:

targets[targets['exceeded'] == False]

Unnamed: 0,month_of_order_date,category,target,sales,exceeded
0,Jan-11,Furniture,10000.0,5477.0,False
2,Mar-11,Furniture,10300.0,7210.0,False
3,Apr-11,Furniture,10400.0,4115.0,False
4,May-11,Furniture,10500.0,8653.0,False
6,Jul-11,Furniture,10800.0,2282.0,False
7,Aug-11,Furniture,10900.0,10606.0,False
9,Oct-11,Furniture,11100.0,4084.0,False
12,Jan-12,Furniture,11500.0,5525.0,False
13,Feb-12,Furniture,11600.0,5820.0,False
14,Mar-12,Furniture,11800.0,9496.0,False
