<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice SQL with `pandas`, Pt. 2


---

We've learned about relational databases and the language most use to query them: SQL.  

In this lab, we are going to get more practice converting information to a SQL database, querying that data, and then analyzing it with Python.

In [1]:
# Necessary libraries:
import pandas as pd
import sqlite3
from pandas.io import sql

#### 1.  Read in the `EuroMart` `.csv` data.
- 'EuroMart-ListOfOrders.csv'
- 'EuroMart-OrderBreakdown.csv'
- 'EuroMart-SalesTargets.csv'

In [2]:
# Reading in the `.csv` to a DataFrame:
orders = pd.read_csv('./datasets/csv/EuroMart-ListOfOrders.csv', encoding = 'utf-8')
OBD =  pd.read_csv('./datasets/csv/EuroMart-OrderBreakdown.csv', encoding = 'utf-8')
sales_targets =  pd.read_csv('./datasets/csv/EuroMart-SalesTargets.csv', encoding = 'utf-8')

#### 2. Rename columns to remove any spaces.

In [9]:
orders.columns = [col.replace(' ', '') for col in orders.columns]
OBD.columns = [col.replace(' ', '') for col in OBD.columns]
sales_targets.columns = [col.replace(' ', '') for col in sales_targets.columns]

#### 3. Remove dollar signs from the `sales` and `profit` columns in the `order breakdown` DataFrame.

Convert the columns to float.

In [20]:
OBD['Sales'] = [e.replace('$', '') for e in OBD['Sales']]
OBD['Profit'] = [e.replace('$', '') for e in OBD['Profit']]
OBD.head()

Unnamed: 0,OrderID,ProductName,Discount,Sales,Profit,Quantity,Category,Sub-Category
0,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper
1,AZ-2011-9050313,"Dania Corner Shelving, Traditional",0.0,854.0,290.0,7,Furniture,Bookcases
2,AZ-2011-6674300,"Binney & Smith Sketch Pad, Easy-Erase",0.0,140.0,21.0,3,Office Supplies,Art
3,BN-2011-2819714,"Boston Markers, Easy-Erase",0.5,27.0,-22.0,2,Office Supplies,Art
4,BN-2011-2819714,"Eldon Folders, Single Width",0.5,17.0,-1.0,2,Office Supplies,Storage


#### 4. Create a SQL database called `EuroMart` and save the three DataFrames as SQL tables. 

In [24]:
# Establishing a local DB connection:
db_connection = sqlite3.connect('./datasets/sql/2.db.sqlite')


In [25]:
orders.to_sql(name = 'orders', con = db_connection, if_exists = 'replace', index = False)
OBD.to_sql(name = 'OBD', con = db_connection, if_exists = 'replace', index = False)
sales_targets.to_sql(name = 'sales_targets', con = db_connection, if_exists = 'replace', index = False)

#### 5. How many orders has each customer placed? 

In [27]:
def q(query, connection=db_connection):
    return sql.read_sql(query, connection)

In [28]:
q("""SELECT * FROM orders LIMIT 1
""")

Unnamed: 0,OrderID,OrderDate,CustomerName,City,Country,Region,Segment,ShipDate,ShipMode,State
0,BN-2011-7407039,1/1/2011,Ruby Patel,Stockholm,Sweden,North,Home Office,1/5/2011,Economy Plus,Stockholm


In [33]:
q("""SELECT "CustomerName", COUNT("CustomerName") AS order_no
    FROM orders 
    GROUP BY "CustomerName"
    ORDER BY COUNT("CustomerName") DESC
    LIMIT 5
""")

Unnamed: 0,CustomerName,order_no
0,Jose Gambino,13
1,Kayla Tearle,12
2,Mark Washington,12
3,Aaron Bootman,11
4,Georgina Garner,11


> *If you're doubting your output, check using `pandas`.*

#### 6. Create a query to return a table containing only geographic features from the `list of orders` table.

In [41]:
q("""SELECT "City", "Country", "Region", "State"
    FROM orders
    GROUP BY "City", "Country", "Region", "State"
    LIMIT 5
""")

Unnamed: 0,City,Country,Region,State
0,Aachen,Germany,Central,North Rhine-Westphalia
1,Aalen,Germany,Central,Baden-Württemberg
2,Aalst,Belgium,Central,East Flanders
3,Abbeville,France,Central,Nord-Pas-de-Calais-Picardie
4,Aberdeen,United Kingdom,North,Scotland


#### 7. Create a query to return a table containing all orders that had a negative profit from the `order breakdown` table.

In [42]:
q("""SELECT * FROM OBD LIMIT 1
""")

Unnamed: 0,OrderID,ProductName,Discount,Sales,Profit,Quantity,Category,Sub-Category
0,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper


In [44]:
q("""SELECT * 
    FROM OBD 
    WHERE "Profit" < 0
    LIMIT 5
""")

Unnamed: 0,OrderID,ProductName,Discount,Sales,Profit,Quantity,Category,Sub-Category
0,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper
1,BN-2011-2819714,"Boston Markers, Easy-Erase",0.5,27.0,-22.0,2,Office Supplies,Art
2,BN-2011-2819714,"Eldon Folders, Single Width",0.5,17.0,-1.0,2,Office Supplies,Storage
3,BN-2011-3248724,"Ikea Classic Bookcase, Metal",0.6,987.0,-1012.0,6,Furniture,Bookcases
4,BN-2011-3248724,"Binney & Smith Sketch Pad, Blue",0.5,116.0,-56.0,5,Office Supplies,Art


#### 8. Construct a query to return a table containing `customer_name` and `product_name`.  

> **Note:** This will require a JOIN!

In [55]:
q("""SELECT a."CustomerName", a."OrderID", b."ProductName"
    FROM orders AS a
    LEFT JOIN OBD AS b
    ON a."OrderID" = b."OrderID"
    GROUP BY a."CustomerName", a."OrderID", b."ProductName"
    LIMIT 10
""")

Unnamed: 0,CustomerName,OrderID,ProductName
0,Aaron Bootman,AZ-2011-2169445,"Apple Office Telephone, Cordless"
1,Aaron Bootman,AZ-2011-2169445,"Elite Box Cutter, High Speed"
2,Aaron Bootman,AZ-2011-2169445,"Harbour Creations Round Labels, Laser Printer ..."
3,Aaron Bootman,AZ-2011-3937280,"Elite Letter Opener, High Speed"
4,Aaron Bootman,AZ-2011-3937280,"Tenex Light Bulb, Duo Pack"
5,Aaron Bootman,AZ-2011-9409671,"Binney & Smith Markers, Fluorescent"
6,Aaron Bootman,AZ-2011-9409671,"Novimex Shipping Labels, Alphabetical"
7,Aaron Bootman,AZ-2012-808163,"Konica Receipt Printer, Durable"
8,Aaron Bootman,AZ-2012-9056595,"Office Star Bag Chairs, Black"
9,Aaron Bootman,AZ-2012-9056595,"Okidata Calculator, Durable"


#### 9.  How many orders for "office supplies" (category) has Sweden made?

> **Note:** From this point on, you'll probably be combining SQL and `pandas`, in that you would use SQL queries to gather relevant information and then `pandas` to analyze it.

In [60]:
ordersdf = q("""SELECT * FROM orders
""")

In [176]:
ordersdf[ordersdf['Country'] == 'Sweden'].groupby('Segment').size()

Segment
Consumer       45
Corporate      39
Home Office    16
dtype: int64

#### 10. What were total sales for discontinued products? 

In [82]:
q("""SELECT * FROM orders LIMIT 1
""")

Unnamed: 0,OrderID,OrderDate,CustomerName,City,Country,Region,Segment,ShipDate,ShipMode,State
0,BN-2011-7407039,1/1/2011,Ruby Patel,Stockholm,Sweden,North,Home Office,1/5/2011,Economy Plus,Stockholm


In [93]:
q("""SELECT "CustomerName", CAST(EXTRACT('year' FROM "OrderDate") AS Int) AS "year"
    FROM orders
""")

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))



DatabaseError: Execution failed on sql 'SELECT "OrderDate", "Freight",
        CAST(EXTRACT('year' FROM "OrderDate") AS Int) AS "year",
        CAST(EXTRACT('month' FROM "OrderDate") AS Int) AS "month",
        CAST(EXTRACT('day' FROM "OrderDate") AS Int) AS "day"
        FROM orders
': near "FROM": syntax error

#### 11. What is the total quantity of objects sold for each country?

In [67]:
ordersdf.groupby('Country').size()

Country
Austria           135
Belgium            68
Denmark            29
Finland            34
France            991
Germany           806
Ireland            50
Italy             493
Netherlands       194
Norway             37
Portugal           37
Spain             403
Sweden            100
Switzerland        40
United Kingdom    700
dtype: int64

#### 12. In what countries were profits lowest? (Report the lowest 5-10).

In [172]:
combine = q("""SELECT a.*, b.* 
            FROM orders AS a
            LEFT JOIN OBD AS b
            ON a."OrderID"=b."OrderID" """) 

In [173]:
combine['Profit'] = [float(cell.replace(',', '')) for cell in combine['Profit']]
combine['Sales'] = [float(cell.replace(',', '')) for cell in combine['Sales']]

In [177]:
combine.groupby('Country').mean().sort_values('Profit')

Unnamed: 0_level_0,Discount,Sales,Profit,Quantity
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Portugal,0.5,215.8,-124.342857,4.085714
Netherlands,0.480662,178.913486,-94.625954,3.882952
Sweden,0.507882,150.197044,-86.325123,3.70936
Ireland,0.503,159.98,-68.86,3.92
Denmark,0.508333,129.383333,-60.133333,3.4
Italy,0.12523,258.163432,16.14096,3.689479
France,0.076722,318.206159,36.569415,3.825157
Germany,0.054787,297.97622,52.609146,3.767683
Finland,0.0,323.46875,61.0625,3.140625
Spain,0.034297,327.729304,61.848883,3.785808


#### 13. What counties have the best and worst sales-to-profit ratios?
(Total sales divided by total profits).

Essentially, this is asking us to determine the profit made for every dollar of product sold.

In [215]:
combine['salestoprofit'] = combine['Sales']/combine['Profit'] 

In [216]:
combine[combine['salestoprofit'] != np.inf].groupby('Country').mean()

Unnamed: 0_level_0,Discount,Sales,Profit,Quantity,salestoprofit
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Austria,0.0,306.612403,82.682171,3.709302,8.597267
Belgium,0.0,314.454545,75.090909,3.916667,8.005054
Denmark,0.508475,130.305085,-61.152542,3.40678,-5.550969
Finland,0.0,326.126984,62.031746,3.142857,10.723186
France,0.077204,320.920498,37.894538,3.842077,4.274455
Germany,0.054622,298.467208,53.890693,3.777014,5.763112
Ireland,0.503125,162.114583,-71.729167,3.864583,-4.077488
Italy,0.126702,258.604188,16.546597,3.693194,4.478296
Netherlands,0.480739,182.379947,-98.121372,3.868074,-3.236133
Norway,0.0,296.373134,77.119403,3.791045,7.4478


#### 14. What shipping method is most commonly used for "bookcases" (sub-category)?

In [229]:
combine[combine['Sub-Category']=='Bookcases'].groupby('ShipMode').size()

ShipMode
Economy         234
Economy Plus     76
Immediate        22
Priority         59
dtype: int64

#### 15. Which city in the `orders` table generated the highest net sales? (List all cities and countries in descending order by net sales).

In [235]:
combine.groupby('City').sum().sort_values('Sales', ascending=False)

Unnamed: 0_level_0,Discount,Sales,Profit,Quantity,salestoprofit
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
London,24.30,69230.0,13931.0,820,inf
Berlin,23.50,52555.0,5942.0,690,inf
Vienna,0.00,51844.0,13207.0,714,inf
Madrid,4.10,44981.0,11129.0,580,inf
Paris,6.20,42245.0,6680.0,496,inf
Rome,13.90,28330.0,191.0,409,inf
Barcelona,2.80,27405.0,2246.0,231,inf
Hamburg,1.70,23574.0,5858.0,385,inf
Marseille,3.35,21677.0,2889.0,283,inf
Turin,7.70,19829.0,1937.0,236,inf


####  BONUS: Create a column called `shipping_delay` in the `orders` table that contains the difference in days between `order_date` and `ship_date`.

In [244]:
combine['shipping_delay'] = pd.to_datetime(combine['ShipDate']) - pd.to_datetime(combine['OrderDate'])
combine['shipping_delay'] = combine['shipping_delay'].apply(lambda x: x.days)

In [245]:
combine.head()

Unnamed: 0,OrderID,OrderDate,CustomerName,City,Country,Region,Segment,ShipDate,ShipMode,State,OrderID.1,ProductName,Discount,Sales,Profit,Quantity,Category,Sub-Category,salestoprofit,shipping_delay
0,BN-2011-7407039,1/1/2011,Ruby Patel,Stockholm,Sweden,North,Home Office,1/5/2011,Economy Plus,Stockholm,BN-2011-7407039,"Enermax Note Cards, Premium",0.5,45.0,-26.0,3,Office Supplies,Paper,-1.730769,4
1,AZ-2011-9050313,1/3/2011,Summer Hayward,Southport,United Kingdom,North,Consumer,1/7/2011,Economy,England,AZ-2011-9050313,"Dania Corner Shelving, Traditional",0.0,854.0,290.0,7,Furniture,Bookcases,2.944828,4
2,AZ-2011-6674300,1/4/2011,Devin Huddleston,Valence,France,Central,Consumer,1/8/2011,Economy,Auvergne-Rhône-Alpes,AZ-2011-6674300,"Binney & Smith Sketch Pad, Easy-Erase",0.0,140.0,21.0,3,Office Supplies,Art,6.666667,4
3,BN-2011-2819714,1/4/2011,Mary Parker,Birmingham,United Kingdom,North,Corporate,1/9/2011,Economy,England,BN-2011-2819714,"Boston Markers, Easy-Erase",0.5,27.0,-22.0,2,Office Supplies,Art,-1.227273,5
4,BN-2011-2819714,1/4/2011,Mary Parker,Birmingham,United Kingdom,North,Corporate,1/9/2011,Economy,England,BN-2011-2819714,"Eldon Folders, Single Width",0.5,17.0,-1.0,2,Office Supplies,Storage,-17.0,5


#### BONUS: Update your `orders` table in your SQLite3 DB to include the `shipping_delay` feature.

In [20]:
# A:

#### BONUS: Which product category has the highest average `shipping_delay`?

In [21]:
# A:

### Challenge

**In which months and categories were sales targets exceeded?**

---

This may require a considerable amount of data processing.

In [22]:
# A: