# Welcome to an Introduction to Python VV!
Today we will be going through some basics of Python which will include:
* What is it and use cases
* Overview of Jupyter Notebooks
* Some basic Python functionality
* Orion basics
* Data transformations
* Machine learning
* Plotting your data (if we have time)


## What is Python?
Pythons are nonvenomous snakes found in Asia, Africa and Australia! However we wont be discussing those today! Python is a general purpose progamming language which was made in the 90s! It can be used for both software engineering and analysis/data science! The use cases I normally use it for is building Machine Learning models, which in turn means I create features and do some data analysis (although R is also a good way to do these)

## Useful Libraries 
There are several libraries that have been created for Python. To use these you run the code `import x as xxx`. The most useful ones that we use are:
* `pandas` - used for data manipulation and analysis
* `numpy` - for working with arrays
* `sklearn` - (Sci-kit Learn) used for Machine Learning
* `matplotlib` - for plotting
* `seaborn` - another libary for plotting

So lets import these. Normally you use `as xxx` to shorten the name. For `pandas` and `numpy` what I have used is standard


In [7]:
import pandas as pd
import numpy as np

If a library is not installed, you can run a bash command in a cell like this!

In [None]:
! pip install sklearn

## Jupyter Notebooks
So there are two ways that we can use Python. One being in a notebook that I am showing you right now! Here we can do EDA, build models and create reports. Main jargon you use is:
* Cells - These are the parts of the notebook that you can create re-runable code in easily
* Kernal - This is what each notebook runs on

Some quick shortcuts include:
* `Shift Enter` - Run Cell
* `ESC A` - Add cell above
* `ESC B` - Add cell below
* `ESC DD` - Delete cell

There are lots of shortcuts which you can view in the drop downs above. 


## Python functionality
There are lots of different ways to use Python!

### Dictionaries and Lists
Along with flat dataframes, Python can also handle dictionaries and lists! Dictionaries are a combination of keys and values where lists are just values! These can be used together as well. So you have have a list of dictionarys or a dictionary of lists!

In [8]:
dictionary = {'name': 'vanessa virgo', 'age':26 }
lists = [12,56, 99]

In [9]:
dictionary

{'name': 'vanessa virgo', 'age': 26}

In [10]:
lists

[12, 56, 99]

### Slicing and Dicing
Python allows you to slice and dice data however you wish! If we take `objects` from above we can subselect parts from the dictionary/list. To select from list elements you use [0] and dictonaries are [name]. First elements of a list start at 0!

In [14]:
lists[0]

12

In [15]:
dictionary['name']

'vanessa virgo'

### Functions
Most programming languages have the ability to create functions (re-runable code that can be called in scripts, packages etc). Some important things to remember when writing functions is:
* **White space** is important!
* Always define your functions with `def():`
* Using `x:type` after you declare your variables is best practice!


In [18]:
def add_numbers(number1:int, number2:int):
    return number1 + number2

add_numbers(1,3)

4

### If/Else statements
If statements are a staple in many languages. Like functions, they have some requirement.
* **White space** is needed

In [20]:
if 1 == 1:
    print('OF COURSE IT IS')
else:
    print('Something is wrong here!')

OF COURSE IT IS


## Orion
Orion is a Peak created package to easily use Python and its functionality. Most of the functions within Orion use the structure of Python so even if you don't which to use it you can transfer the skills over! 

In [21]:
! pip install -e git+https://github.com/PeakBI/orion.git#egg=orion --user

Obtaining orion from git+https://github.com/PeakBI/orion.git#egg=orion
  Updating ./src/orion clone
  Running command git fetch -q --tags
  Running command git reset --hard -q 0db0a66273b69a83b048d7effcd2b7996d362c60
Installing collected packages: orion
  Attempting uninstall: orion
    Found existing installation: orion 0.5.5
    Uninstalling orion-0.5.5:
      Successfully uninstalled orion-0.5.5
  Running setup.py develop for orion
Successfully installed orion


### Reading in data from Redshift
You can use Orion to read data in and out of Redshift much like you would use `pandas`. To do this first we `import` the function.

In [22]:
# Import Packages
from orion.sources import RedshiftSource
from orion.sources.io import read_csv
from orion.sources.config import RedshiftConfig

Now to create the connection. Remember your env variables must not be empty. to do this use:

In [23]:
import os
os.environ["TENANT"] = 'newstarter'

You will need to run this for:
* `TENANT`
* `REDSHIFT_USERNAME`
* `REDSHIFT_PASSWORD`
* `REDSHIFT_HOST`

In [24]:
conf = RedshiftConfig()
query = "select top 10 * from dunnhumby.products"
df = read_csv(RedshiftSource(**conf, query=query))

The `read_csv()` function is similar to `pd.read_csv()` function you can use to read local files within your notebook space

### Writing to S3
You can also use Orion to save data to S3/Redshift. I personally save data to S3 then use a `SQL` command called `COPY` to copy data into S3. For this example I will show you how to save data to S3. 

In [26]:
from orion.sources import S3Source
from orion.sources.io import write_csv

In [27]:
write_csv(df,S3Source(bucket = 'kilimanjaro-prod-datalake', 
                      key = 'newstarter/datascience/pythontraining/demo.csv'), index=False)

In [28]:
read_csv(S3Source(bucket = 'kilimanjaro-prod-datalake', 
                      key = 'newstarter/datascience/pythontraining/demo.csv'))

Unnamed: 0,product_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,27633,69,GROCERY,Private,HISPANIC,MEXICAN BEANS REFRIED,16 OZ
1,28143,69,GROCERY,Private,DRY MIX DESSERTS,GELATIN,.3 OZ
2,29037,69,GROCERY,Private,HOT CEREAL,STANDARD OATMEAL,42 OZ
3,30177,544,GROCERY,National,BAG SNACKS,TORTILLA/NACHO CHIPS,11.5 OZ
4,31534,69,GROCERY,Private,SHORTENING/OIL,CORN OIL,48 OZ
5,32553,69,GROCERY,Private,FLUID MILK PRODUCTS,FLUID MILK WHITE ONLY,
6,33614,436,GROCERY,National,FRZN JCE CONC/DRNKS,FRZN CONC UNDER 50% JUICE,12 OZ
7,34426,1075,GROCERY,National,CRACKERS/MISC BKD FD,SNACK CRACKERS,9.5 OZ
8,35334,1251,GROCERY,National,PASTA SAUCE,MAINSTREAM,26 OZ
9,36375,1251,GROCERY,National,FROZEN PIE/DESSERTS,FROZEN CAKES/ALL TYPES INCLUDI,19.6 OZ


If you choose not to use Orion you can use `boto3`. Within this you can:
* View whats in a bucket
* Read any types of files

To use this:

In [29]:
import boto3
s3 = boto3.client('s3')
objects = s3.list_objects(Bucket='kilimanjaro-prod-datalake',Prefix = 'newstarter/datascience/pythontraining/')

objects

{'ResponseMetadata': {'RequestId': '2AFF6976C0FCF96B',
  'HostId': '/O07hfl8HQhr/pE4WDMb6ja90IWox9MiF19ybp6NrYBBFOaQOFso3PwSbBgfC4A3aZRtRHynYBk=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '/O07hfl8HQhr/pE4WDMb6ja90IWox9MiF19ybp6NrYBBFOaQOFso3PwSbBgfC4A3aZRtRHynYBk=',
   'x-amz-request-id': '2AFF6976C0FCF96B',
   'date': 'Fri, 12 Feb 2021 15:18:30 GMT',
   'x-amz-bucket-region': 'eu-west-1',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 1},
 'IsTruncated': False,
 'Marker': '',
 'Contents': [{'Key': 'newstarter/datascience/pythontraining/demo.csv',
   'LastModified': datetime.datetime(2021, 2, 12, 15, 17, 52, tzinfo=tzlocal()),
   'ETag': '"c1a1bf76150178df4952d59ed6f5990e"',
   'Size': 912,
   'StorageClass': 'STANDARD',
   'Owner': {'DisplayName': 'aws-prod',
    'ID': '237deaa8f1dbbba706e273fc913c6ad84913cba9a1190d7d3ac77ac300721e01'}}],
 'Name': 'kilimanjaro-prod-datalake',
 'Prefix': 'newstarter

## Data Transformations
So now that you have some data time for the fun bit! There are lots of different things we can use `pandas` for to manipulate and analyse the data we have! I will run through: 

* Checking the data
* Filtering data
* Creating new columns
* Joining data
* Aggregating the data

### Check the data
First lets pull out both tables we will be using

In [30]:
products_query = "select  * from dunnhumby.products"
products = read_csv(RedshiftSource(**conf, query=products_query))

transactions_query = "select * from dunnhumby.transactions"
transactions = read_csv(RedshiftSource(**conf, query=transactions_query))

In [31]:
products.head() # Take a look at the first 5 rows of the data

Unnamed: 0,product_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,27510,69,GROCERY,Private,VEGETABLES - SHELF STABLE,MIXED VEGETABLES,15 OZ
1,28102,69,GROCERY,Private,FD WRAPS/BAGS/TRSH BG,FREEZER BAGS,20 CT
2,28919,236,GROCERY,National,REFRGRATD DOUGH PRODUCTS,REFRIGERATED COOKIES-CHUB,16.5 OZ
3,30003,397,MEAT-PCKGD,National,FROZEN MEAT,FRZN BREADED PREPARED CHICK,9 OZ
4,31412,693,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1.55 OZ


In [32]:
transactions.head()

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984850000.0,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
1,1172,26985030000.0,1,981760,1,0.79,396,-0.6,946,1,0.0,0.0
2,1060,26985040000.0,1,942708,1,5.24,315,0.0,1251,1,0.0,0.0
3,744,26985170000.0,1,5978648,0,0.0,31582,0.0,1119,1,0.0,0.0
4,718,26985360000.0,1,830503,1,2.99,324,-1.0,1115,1,-1.0,0.0


In [33]:
print('The number of rows in the transactions table is {}'.format(len(transactions))) # Can join strings together
print('The number of rows in the product table is {}'.format(len(products)))

The number of rows in the transactions table is 2595732
The number of rows in the product table is 92353


In [34]:
# Check nulls in each column
transactions.isnull().sum()

household_key        0
basket_id            0
day                  0
product_id           0
quantity             0
sales_value          0
store_id             0
retail_disc          0
trans_time           0
week_no              0
coupon_disc          0
coupon_match_disc    0
dtype: int64

In [35]:
# For every numerical column
transactions.describe()

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
count,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0,2595732.0
mean,1271.953,34026200000.0,388.7562,2891435.0,100.4286,3.10412,3142.673,-0.5387054,1561.586,56.2215,-0.016416,-0.002918564
std,726.066,4711649000.0,189.721,3837404.0,1153.436,4.182274,8937.113,1.249191,399.8378,27.10223,0.216841,0.03969004
min,1.0,26984850000.0,1.0,25671.0,0.0,0.0,1.0,-180.0,0.0,1.0,-55.93,-7.7
25%,656.0,30408050000.0,229.0,917459.0,1.0,1.29,330.0,-0.69,1308.0,33.0,0.0,0.0
50%,1272.0,32760810000.0,390.0,1028816.0,1.0,2.0,372.0,-0.01,1613.0,56.0,0.0,0.0
75%,1913.0,40126850000.0,553.0,1133018.0,1.0,3.49,422.0,0.0,1843.0,80.0,0.0,0.0
max,2500.0,42305360000.0,711.0,18316300.0,89638.0,840.0,34280.0,3.99,2359.0,102.0,0.0,0.0


In [36]:
# Get the shape of the dataset. Returns (rows, columns)
products.shape

(92353, 7)

### Filtering data
For data analysis you may want to remove some of the data that isnt needed eg where sales below a certain value (say the 25%). 

In [39]:
transactions = transactions[transactions['sales_value'] <= 3.49]

### Creating new columns
You may want to create new columns with a calculation of other columns or with another value.

In [41]:
transactions['price'] = transactions['sales_value'] * transactions['quantity']

In [42]:
transactions.head()

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price
0,2375,26984850000.0,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,0.82
1,1172,26985030000.0,1,981760,1,0.79,396,-0.6,946,1,0.0,0.0,0.79
3,744,26985170000.0,1,5978648,0,0.0,31582,0.0,1119,1,0.0,0.0,0.0
4,718,26985360000.0,1,830503,1,2.99,324,-1.0,1115,1,-1.0,0.0,2.99
6,718,26985360000.0,1,1098694,1,2.5,324,-0.99,1115,1,0.0,0.0,2.5


### Joining data
Like SQL and R we can join data in Python. Even though it is likely faster at source (eg in a database), if we had some data in S3 we wouldn't be able to query it from there. There are several ways to join data:
* `pd.merge()` - Common way to join tables together, can state if its 'left', 'right', 'inner'
* `join()` - Similar to above
* `pd.concat()` - concatenation across different tables eg with the same columns . Can also use `append()`

In [43]:
df = pd.merge(transactions, products, on = 'product_id', how='left')

In [45]:
transactions.head()

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price
0,2375,26984850000.0,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,0.82
1,1172,26985030000.0,1,981760,1,0.79,396,-0.6,946,1,0.0,0.0,0.79
3,744,26985170000.0,1,5978648,0,0.0,31582,0.0,1119,1,0.0,0.0,0.0
4,718,26985360000.0,1,830503,1,2.99,324,-1.0,1115,1,-1.0,0.0,2.99
6,718,26985360000.0,1,1098694,1,2.5,324,-0.99,1115,1,0.0,0.0,2.5


In [46]:
products.head()

Unnamed: 0,product_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,27510,69,GROCERY,Private,VEGETABLES - SHELF STABLE,MIXED VEGETABLES,15 OZ
1,28102,69,GROCERY,Private,FD WRAPS/BAGS/TRSH BG,FREEZER BAGS,20 CT
2,28919,236,GROCERY,National,REFRGRATD DOUGH PRODUCTS,REFRIGERATED COOKIES-CHUB,16.5 OZ
3,30003,397,MEAT-PCKGD,National,FROZEN MEAT,FRZN BREADED PREPARED CHICK,9 OZ
4,31412,693,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1.55 OZ


### Aggregating data
Aggregating data is very important, and its very easy to do it Python! The main syntax for this is:
* `groupby(x)` - This states what you are grouping by. You can group by more than one column! 
* `[x].whateveryouwant()` - this follows stating what you want to do with the grouping

The most common aggregations are:
* `sum` - sums the values in each group by column x
* `count` - counts the occurances 
* `mean` - mean value of the the column
* `median` - median value of the column
* `max` - max value
* `min` - min value

(FYI there is different syntax to group by - this is just how I do it!)

In [47]:
# Getting the quantity of every group
grouped_data = pd.DataFrame(
                        df.groupby('commodity_desc')['quantity'].sum()
                                ).reset_index()

In [48]:
grouped_data.head()

Unnamed: 0,commodity_desc,quantity
0,,0
1,(CORP USE ONLY),37
2,ADULT INCONTINENCE,67
3,AIR CARE,4338
4,ANALGESICS,1362


Instead of using `[x].whateveryouwant()` you can use `agg()` which enables you to do several aggreagtions on the grouping you have

In [58]:
pd.DataFrame(
    df.groupby('commodity_desc').agg({'quantity':['sum','max']}
                                    )
).reset_index().rename(columns={'xxx':'xxxx'})


df.groupby(x).agg(total_quantity = ('quantity','sum), average_quantity = ('quantity' , 'mean')).reset_index()

Unnamed: 0_level_0,commodity_desc,quantity,quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max
0,,0,0
1,(CORP USE ONLY),37,3
2,ADULT INCONTINENCE,67,2
3,AIR CARE,4338,7
4,ANALGESICS,1362,3
...,...,...,...
298,WAREHOUSE SNACKS,9504,5
299,WATCHES/CALCULATORS/LOBBY,20,4
300,WATER,2912,11
301,WATER - CARBONATED/FLVRD DRINK,16207,7


### Sorting values
Now when you have your grouped data, you may want to see what the top values are! Or even with you raw data you might want to see the upper or lower values! When using this type of syntax you can use `inplace=True` which allows you to just alter the current dataframe without renaming it (use at your own risk!)

In [57]:
grouped_data.sort_values(by = 'quantity', ascending=False, inplace=True)

## Machine Learning!
There are different ways you can use machine learning in Python! 
* `sklearn`
* `tensorflow` 
* Code it yourself!

The most common library to use is `sklearn` and I will show you how to train a model and test a model!

There are three types of ML techniques we can build using `sklearn`
1. Regression - Predicting a continuous value (Supervised)
2. Classification - Prediction a label  (Supervised)
3. Clustering - Grouping similar values together (Unsupervised)

I will go though Regression and Classifcation today! However, the clustering syntax is very simlar to the other two!

### Regression

First lets install some libraries

In [59]:
from sklearn.linear_model import LinearRegression ## The model
from sklearn.model_selection import KFold, cross_val_score


We will try and predict the price of a product depending on the day! 

In [60]:
data = pd.DataFrame(df.groupby(['day', 'product_id'])['sales_value'].sum()).reset_index()

In [61]:
data.head()

Unnamed: 0,day,product_id,sales_value
0,1,820162,1.79
1,1,822346,1.25
2,1,824399,1.98
3,1,826249,1.98
4,1,826784,0.99


Now lets create the ability to cross validate your data!


In [62]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)

In [63]:
model = LinearRegression() ## Declare the model

Now we want to check the performance of the model! To do this for a regression technique I am using is R Sqaured which  provides an indication of the goodness of fit of a set of predictions to the actual values.

In [66]:
X = data.drop('sales_value', axis=1)
Y = data['sales_value']

In [67]:
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: 0.002 (0.000)


As you can see this is very poor! The model isnt fitting very well with the data. We can change this by altering the parameters or changing the model! 

### Classifcation


We have focussed primarly on a Regression Technique, however you can also use different algorthms that are used for classification! To do this I am just going to randomly assign a 1 and 0 to the data!

In [68]:
df['label'] = np.random.choice([1, 0], df.shape[0])

In [69]:
class_df = df[['day', 'product_id', 'label']]

With this type of ML I am going to split the data into a train and test set! X is the data and Y is the label

In [72]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_test, y_train, y_test = train_test_split(class_df.drop('label', axis=1), 
                                                    class_df['label'], 
                                                   test_size =0.33)

In [75]:
y_train.head()

1400140    0
258393     1
1425158    1
619771     0
1893106    0
Name: label, dtype: int64

Lets train a model! For this we are going to use a `fit()` function to fit the training data to the model

In [76]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
model = DecisionTreeClassifier().fit(X_train, y_train)

Now we use that model to predict on the testing data

In [77]:
predictions = model.predict(X_test)
predictions

array([0, 0, 1, ..., 0, 1, 0])

In [78]:
model.predict_proba(X_test)

array([[1.  , 0.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       ...,
       [0.75, 0.25],
       [0.  , 1.  ],
       [0.5 , 0.5 ]])

To test the performance we can use a `sklearn` function called `classification_report()`. This shows the accuracy, recall and F1 score for all the labels supplied!

In [79]:
print(classification_report(y_true = y_test.to_numpy(), y_pred = predictions))

              precision    recall  f1-score   support

           0       0.50      0.54      0.52    324199
           1       0.50      0.46      0.48    324743

    accuracy                           0.50    648942
   macro avg       0.50      0.50      0.50    648942
weighted avg       0.50      0.50      0.50    648942



To look more into `sklearn` and its capabilities go here https://scikit-learn.org/

## Plotting Data
Visualising your data is very important. R is best for this but there are several libraries to do this. 
* `matplotlib`
* `seaborn` 
* `plotly`

Today I will take you throught `matplotlib` and `seaborn`. For `matlotlib` you need to run `%matplotlib inline` to run the plots within the notebook. You can also chain library imports if you dont want to install everything.

Additonally Orion has some plotting functionality to make your plots Peak Themed! However this is not needed and you can easily have the charts look how you want!

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# From orion the colours and fonts! 
from orion.contrib.peak.style import install_peak_fonts, apply
from orion.contrib.peak.style import PEAK_COLORS_1, PEAK_COLORS_2, PEAK_COLORS_3

# This loads them up!
install_peak_fonts()
apply()

The most common functions within `matplotlib` are:
* `plt.hist()` - Historgram
* `plt.bar()` - Bar chart
* `plt.plot()` - Line chart

In [None]:
## This doesnt look great!
plt.plot(transactions.groupby('day')['quantity'].sum())

In [None]:
fig, ax = plt.subplots()

plt.plot(
    transactions.groupby('day')['quantity'].sum(), 
    #color = PEAK_COLORS_1[3]
)

plt.xlabel('Day', size=10)
plt.ylabel('Quantity', size=10)
#plt.title('Quantity over time', size=10)
ax.tick_params(axis='both', which='major', labelsize=8)

`seaborn` is very similar! The most common chars are:
* `sns.lineplot()`
* `sns.barplot()`
* `sns.distplot()`

So if we wanted to create the same plot as above we use:

In [None]:
fig, ax = plt.subplots()

# Replaces this with seaborn
sns.lineplot( data = transactions.groupby('day')['quantity'].sum(), 
            color = PEAK_COLORS_1[3])


plt.xlabel('Day', size=10)
plt.ylabel('Quantity', size=10)
plt.title('Quantity over time', size=10)
ax.tick_params(axis='both', which='major', labelsize=8)

You may also wish to show data within two groups, eg department and quantity over time. 

In [None]:
data_to_plot = pd.DataFrame(df.groupby(['day', 'department'])['quantity'].sum()).reset_index()

In [None]:
data_to_plot.head()

Sometimes plotting all of the data can look unreadable! 

In [None]:
sns.lineplot(data = data_to_plot,
             x = 'day', 
             y = 'quantity', 
            hue = 'department')

Lets filter some out! Some functions I am using here:
* `unique()` - Get the distinct values in that list
* `isin()` - Only select the data I want

In [None]:
data_to_plot = data_to_plot[data_to_plot['day'] <= 30]

In [None]:
departments = data_to_plot.sort_values('quantity', ascending=False)['department'].unique()[0:9]
data_to_plot = data_to_plot[data_to_plot['department'].isin(departments)]

In [None]:
fig, ax = plt.subplots()

sns.lineplot(data = data_to_plot,
             x = 'day', 
             y = 'quantity', 
            hue = 'department')

plt.xlabel('Day', size=10)
plt.ylabel('Quantity', size=10)
plt.title('Quanty over time', size=10)
ax.tick_params(axis='both', which='major', labelsize=8)

In [None]:
fig, ax = plt.subplots()

sns.lineplot(data = data_to_plot,
             x = 'day', 
             y = 'quantity', 
            hue = 'department')

plt.xlabel('Day', size=10)
plt.ylabel('Quantity', size=10)
plt.title('Quantity over time', size=10)

ax.legend(loc = 'best',
          fontsize = 5, 
          title = 'Department', 
          title_fontsize =9 )


ax.tick_params(axis='both', which='major', labelsize=8)
