# From Excel to Python

![image.png](attachment:image.png)

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.float_format', '{:,.2f}'.format)
file = "data/ecommerce.csv"
df = pd.read_csv(file, decimal=',')

## Approaching Your Dataset for the First Time

Explore your dataset quickly using Pandas, before you start your analysis
- Checking how your dataset looks like
- Finding out quickly how many columns and rows
- What are all the column names
- What data type does each column have
- etc

In [2]:
# What data do you have? Print your dataframe
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.00,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.00,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.00,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.1,12680.00,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.00,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.00,France


In [3]:
# total number of columns & rows in short
# hint, we can use .shape property
df.shape

(541909, 8)

In [4]:
# All the column names
# hint, we can use columns property
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [6]:
# What data types do the columns have? 
# hint, we can use dtypes property
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice       object
CustomerID     float64
Country         object
dtype: object

In [7]:
# describe() is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. 
# It summarizes central tendency and dispersion of the dataset. 
# describe() helps in getting a quick overview of the dataset. 
df.describe()

Unnamed: 0,Quantity,CustomerID
count,541909.0,406829.0
mean,9.55,15287.69
std,218.08,1713.6
min,-80995.0,12346.0
25%,1.0,13953.0
50%,3.0,15152.0
75%,10.0,16791.0
max,80995.0,18287.0


From the quick statistic overview above, does everything make sense?
Should we explore a few things that bother us about the data 👆?

> 💡 Hint: there are rows where the Quantity is minus

In [9]:
# Let's sort the values by Quantity
# hint, we can use #sort_values(by="Column Name")
df.sort_values(by="Quantity")

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
540422,C581484,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,12/9/2011 9:27,2.08,16446.00,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,1/18/2011 10:17,1.04,12346.00,United Kingdom
225529,556690,23005,printing smudges/thrown away,-9600,6/14/2011 10:37,0,,United Kingdom
225530,556691,23005,printing smudges/thrown away,-9600,6/14/2011 10:37,0,,United Kingdom
4287,C536757,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,-9360,12/2/2010 14:23,0.03,15838.00,United Kingdom
...,...,...,...,...,...,...,...,...
421632,573008,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,4800,10/27/2011 12:26,0.21,12901.00,United Kingdom
74614,542504,37413,,5568,1/28/2011 12:03,0,,United Kingdom
502122,578841,84826,ASSTD DESIGN 3D PAPER STICKERS,12540,11/25/2011 15:57,0,13256.00,United Kingdom
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,1/18/2011 10:01,1.04,12346.00,United Kingdom


### Filtering data

How to see data from only 1 or a few selected columns?

In [11]:
# See only 1 column
# hint, access the columns of the DataFrame using a square bracket notation -> []
# i.e. df['Column 1']
# hint #2, you can also access a single column of a DataFrame using the `dot notation`
# as long as the column name contains no space
# i.e. df.Sales to access a column called `Sales`

# df['Country']
df.Country


0         United Kingdom
1         United Kingdom
2         United Kingdom
3         United Kingdom
4         United Kingdom
               ...      
541904            France
541905            France
541906            France
541907            France
541908            France
Name: Country, Length: 541909, dtype: object

#### Aggregate Functions in Python

By selecting only 1 column like we did above 👆, we turn a DataFrame into `Series`<br>
We can use the aggregate functions below to run aggregation calculation on a Series, like how we would with Excel<br>
=SUM(A1-A10)

> `SUM`, `MEAN`, `MIN`, `MAX`, `COUNT` ..?

In [18]:
# How to use aggregation function in Python?
# Hint, just like in Excel, other than the `COUNT` agg function, the functions can only be applied to 
# a column with numerical values/dtypes

df.Country.count()



541909

Can I quickly see or count unique values of a certain column?

In [20]:
# we can drop the duplicated values first using #drop_duplicates() and then #count()
df.Country.drop_duplicates().count()


38

In [21]:
# Or we can also use #nunique() directly on the Series

df.Country.nunique()

38

We can also quickly count the rows per value

In [22]:
# let's try using #value_counts()
df.Country.value_counts()

United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon 

#### Selecting Multiple Columns & Rows

In [24]:
# See multiple columns
# hint, we can pass a python list of column names into the square bracket
# recap, a Python list is wrapped inside a square bracket like this:
# ['Column 1', 'Column 2']

selected_headers = ['StockCode', 'Description', 'UnitPrice']
df[selected_headers].drop_duplicates()

df[['StockCode', 'Description', 'UnitPrice']]

Unnamed: 0,StockCode,Description,UnitPrice
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,2.55
1,71053,WHITE METAL LANTERN,3.39
2,84406B,CREAM CUPID HEARTS COAT HANGER,2.75
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,3.39
4,84029E,RED WOOLLY HOTTIE WHITE HEART.,3.39
...,...,...,...
540908,DOT,DOTCOM POSTAGE,933.17
541297,22738,RIBBON REEL SNOWY VILLAGE,7.5
541540,DOT,DOTCOM POSTAGE,1714.17
541541,M,Manual,224.69


..and also limiting the number of rows by filtering them

In [25]:
# simple filter from rows xx to yy
# hint, we can pass a range of row number into the square bracket, i.e. 0:10

df[31:40]


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
31,536370,10002,INFLATABLE POLITICAL GLOBE,48,12/1/2010 8:45,0.85,12583.0,France
32,536370,21791,VINTAGE HEADS AND TAILS CARD GAME,24,12/1/2010 8:45,1.25,12583.0,France
33,536370,21035,SET/2 RED RETROSPOT TEA TOWELS,18,12/1/2010 8:45,2.95,12583.0,France
34,536370,22326,ROUND SNACK BOXES SET OF4 WOODLAND,24,12/1/2010 8:45,2.95,12583.0,France
35,536370,22629,SPACEBOY LUNCH BOX,24,12/1/2010 8:45,1.95,12583.0,France
36,536370,22659,LUNCH BOX I LOVE LONDON,24,12/1/2010 8:45,1.95,12583.0,France
37,536370,22631,CIRCUS PARADE LUNCH BOX,24,12/1/2010 8:45,1.95,12583.0,France
38,536370,22661,CHARLOTTE BAG DOLLY GIRL DESIGN,20,12/1/2010 8:45,0.85,12583.0,France
39,536370,21731,RED TOADSTOOL LED NIGHT LIGHT,24,12/1/2010 8:45,1.65,12583.0,France


In [26]:
# filter rows & columns with #loc
# df.loc[row(s), column(s)]

df.loc[31:40, selected_headers]


Unnamed: 0,StockCode,Description,UnitPrice
31,10002,INFLATABLE POLITICAL GLOBE,0.85
32,21791,VINTAGE HEADS AND TAILS CARD GAME,1.25
33,21035,SET/2 RED RETROSPOT TEA TOWELS,2.95
34,22326,ROUND SNACK BOXES SET OF4 WOODLAND,2.95
35,22629,SPACEBOY LUNCH BOX,1.95
36,22659,LUNCH BOX I LOVE LONDON,1.95
37,22631,CIRCUS PARADE LUNCH BOX,1.95
38,22661,CHARLOTTE BAG DOLLY GIRL DESIGN,0.85
39,21731,RED TOADSTOOL LED NIGHT LIGHT,1.65
40,22900,SET 2 TEA TOWELS I LOVE LONDON,2.95


#### Filter rows by adding conditions, like in excel ⏚ 

> 💡 This is what is called a Boolean filtering in Pandas

![image.png](attachment:image.png)

<br><br>
#### I.e. If column equal to or contains certain text


> hint, we can use double equal sign == to compare exact text to the column value


> hint 2, a we can use a `str` (string) function called #startswith, #endswith, or #contains if it's not an exact match<br>
df['Column 1'].str.contains('some text')

In [29]:
# 1. Create the True/False condition -> df['Column 1'].str.contains('some text') -> Print and see what happens
# 2. Use this condition inside on the DataFrame to filter out and return data (all columns)

boolean_saudi_arabia = df.Country == 'Saudi Arabia'

df[boolean_saudi_arabia]


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
100810,544838,22915,ASSORTED BOTTLE TOP MAGNETS,12,2/24/2011 10:34,0.42,12565.0,Saudi Arabia
100811,544838,22363,GLASS JAR MARMALADE,6,2/24/2011 10:34,2.95,12565.0,Saudi Arabia
100812,544838,22362,GLASS JAR PEACOCK BATH SALTS,6,2/24/2011 10:34,2.95,12565.0,Saudi Arabia
100813,544838,22361,GLASS JAR DAISY FRESH COTTON WOOL,6,2/24/2011 10:34,2.95,12565.0,Saudi Arabia
100814,544838,22553,PLASTERS IN TIN SKULLS,12,2/24/2011 10:34,1.65,12565.0,Saudi Arabia
100815,544838,22555,PLASTERS IN TIN STRONGMAN,12,2/24/2011 10:34,1.65,12565.0,Saudi Arabia
100816,544838,22556,PLASTERS IN TIN CIRCUS PARADE,12,2/24/2011 10:34,1.65,12565.0,Saudi Arabia
100817,544838,20781,GOLD EAR MUFF HEADPHONES,2,2/24/2011 10:34,5.49,12565.0,Saudi Arabia
100818,544838,22969,HOMEMADE JAM SCENTED CANDLES,12,2/24/2011 10:34,1.45,12565.0,Saudi Arabia
108127,C545507,22361,GLASS JAR DAISY FRESH COTTON WOOL,-5,3/3/2011 11:43,2.95,12565.0,Saudi Arabia


We can also filter a numeric column with a range of number<br>
i.e. Column 1 > 100 or Column 2 < 50

In [31]:
# and follow the 2 steps in the previous cell
boolean_more_than_5 = df.Quantity > 5
df[boolean_more_than_5].sort_values(by="Quantity")

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.00,United Kingdom
232885,557350,20914,SET/5 RED RETROSPOT LID GLASS BOWLS,6,6/20/2011 10:41,2.95,16670.00,United Kingdom
232887,557350,22807,SET OF 6 T-LIGHTS TOADSTOOLS,6,6/20/2011 10:41,2.95,16670.00,United Kingdom
232896,557389,84978,HANGING HEART JAR T-LIGHT HOLDER,6,6/20/2011 11:03,1.25,15921.00,United Kingdom
232901,557389,23049,RECYCLED ACAPULCO MAT RED,6,6/20/2011 11:03,8.25,15921.00,United Kingdom
...,...,...,...,...,...,...,...,...
421632,573008,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,4800,10/27/2011 12:26,0.21,12901.00,United Kingdom
74614,542504,37413,,5568,1/28/2011 12:03,0,,United Kingdom
502122,578841,84826,ASSTD DESIGN 3D PAPER STICKERS,12540,11/25/2011 15:57,0,13256.00,United Kingdom
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,1/18/2011 10:01,1.04,12346.00,United Kingdom


In [37]:

df[boolean_saudi_arabia & boolean_more_than_5].tail()

df.tail(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541899,581587,22726,ALARM CLOCK BAKELIKE GREEN,4,12/9/2011 12:50,3.75,12680.0,France
541900,581587,22730,ALARM CLOCK BAKELIKE IVORY,4,12/9/2011 12:50,3.75,12680.0,France
541901,581587,22367,CHILDRENS APRON SPACEBOY DESIGN,8,12/9/2011 12:50,1.95,12680.0,France
541902,581587,22629,SPACEBOY LUNCH BOX,12,12/9/2011 12:50,1.95,12680.0,France
541903,581587,23256,CHILDRENS CUTLERY SPACEBOY,4,12/9/2011 12:50,4.15,12680.0,France
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,12/9/2011 12:50,4.95,12680.0,France


<br><br>
***

## Cleaning and Manipulating Data

### String Manipulations

To access the strings contained in a column, we use `.str`<br>
Examples:
* .str.upper -> convert to uppercase, =UPPER() in excel
* .str.lower -> convert to lowercase, =LOWER() in excel
* .str.title -> capitalize first letter of each word
* .str.replace -> replace some chars to a new chars, =SUBSTITUTE() or =REPLACE() in excel
* .str.strip, str.lstrip, str.rstrip -> remove leading or trailing white space in the beginning or end of text, =TRIM in excel
* .str.cat(sep=",") -> to concatenate (join) strings, =CONCAT in excel


In [43]:
# Let's turn the descripion column into a title-ize column

df.Description = df.Description.str.title()

In [44]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,White Hanging Heart T-Light Holder,6,12/1/2010 8:26,2.55,17850.00,United Kingdom
1,536365,71053,White Metal Lantern,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,12/1/2010 8:26,2.75,17850.00,United Kingdom
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
4,536365,84029E,Red Woolly Hottie White Heart.,6,12/1/2010 8:26,3.39,17850.00,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,Pack Of 20 Spaceboy Napkins,12,12/9/2011 12:50,0.85,12680.00,France
541905,581587,22899,Children'S Apron Dolly Girl,6,12/9/2011 12:50,2.1,12680.00,France
541906,581587,23254,Childrens Cutlery Dolly Girl,4,12/9/2011 12:50,4.15,12680.00,France
541907,581587,23255,Childrens Cutlery Circus Parade,4,12/9/2011 12:50,4.15,12680.00,France


### Modifying Data Types

Unit price seems like something that should be numeric

Let's convert the data type of unit price to float, so we can do some calculation with it 🤔

While, CustomerID can be converted to a non-numerical type, because we don't need to do any calculation with it 

InvoiceDate also should be a date or time data type

👇👇👇

In [46]:
# Let's fix the data types using #astype function
# Let's first fix the unit price and the CustomerID
# ⭐️ We have to reassign the changes back to the DataFrame so the changes we made are saved

df.UnitPrice = df.UnitPrice.astype('float')


In [51]:
# df.CustomerID.astype('object')
df.CustomerID = df.CustomerID.fillna('0').astype('int')
df.CustomerID = df.CustomerID.astype('object')

In [55]:
df.describe()


Unnamed: 0,Quantity,UnitPrice
count,541909.0,541909.0
mean,9.55,4.61
std,218.08,96.76
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0


### Adding a new "calculated" column is super simple ⭐️

In [56]:
df['Amount'] = df.UnitPrice * df.Quantity
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Amount
0,536365,85123A,White Hanging Heart T-Light Holder,6,12/1/2010 8:26,2.55,17850,United Kingdom,15.30
1,536365,71053,White Metal Lantern,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,12/1/2010 8:26,2.75,17850,United Kingdom,22.00
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34
4,536365,84029E,Red Woolly Hottie White Heart.,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541904,581587,22613,Pack Of 20 Spaceboy Napkins,12,12/9/2011 12:50,0.85,12680,France,10.20
541905,581587,22899,Children'S Apron Dolly Girl,6,12/9/2011 12:50,2.10,12680,France,12.60
541906,581587,23254,Childrens Cutlery Dolly Girl,4,12/9/2011 12:50,4.15,12680,France,16.60
541907,581587,23255,Childrens Cutlery Circus Parade,4,12/9/2011 12:50,4.15,12680,France,16.60


In [None]:
# Now that we have converted UnitPrice to numeric type, we can use it to calculate the invoice amount
# and create a new column for the invoice amount




### Or adding a new column based on conditional grouping
Grouping the values based on a condition with `np.where` (just like the `IF` statement in Excel)

In [57]:
# use np.where(CONDITION, VALUE IF TRUE, VALUE IF FALSE) to create a new column with a condition

df['Volume'] = np.where(df.Quantity > 500, 'High', 'Low')
df


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Amount,Volume
0,536365,85123A,White Hanging Heart T-Light Holder,6,12/1/2010 8:26,2.55,17850,United Kingdom,15.30,Low
1,536365,71053,White Metal Lantern,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34,Low
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,12/1/2010 8:26,2.75,17850,United Kingdom,22.00,Low
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34,Low
4,536365,84029E,Red Woolly Hottie White Heart.,6,12/1/2010 8:26,3.39,17850,United Kingdom,20.34,Low
...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,Pack Of 20 Spaceboy Napkins,12,12/9/2011 12:50,0.85,12680,France,10.20,Low
541905,581587,22899,Children'S Apron Dolly Girl,6,12/9/2011 12:50,2.10,12680,France,12.60,Low
541906,581587,23254,Childrens Cutlery Dolly Girl,4,12/9/2011 12:50,4.15,12680,France,16.60,Low
541907,581587,23255,Childrens Cutlery Circus Parade,4,12/9/2011 12:50,4.15,12680,France,16.60,Low


In [58]:
df.Volume.value_counts()

Low     541476
High       433
Name: Volume, dtype: int64

#### Advanced grouping to replace nested IFs :)
If you find yourself doing nested `IF`s in an excel column:<br>
i.e. -> =IF(A1<10, 'Bad', IF(A1>=10 & A1<50, 'OK', 'Good'))<br>
...then we can use `np.select` to group the values with multiple conditions


In [59]:
# Let's apply this to the Quantity column, and create a new column called 'Volume', with values Low, Medium or High
# We need to create 2 lists:

# 1. A `list` for all the conditions
conditions = [
    (df.Quantity <= 50),
    (df.Quantity > 50) & (df.Quantity <= 100),
    (df.Quantity > 100) & (df.Quantity <= 500),
    (df.Quantity > 500)
]

# 2. A `list` for the corresponding values
values = [
    'Low', 'Medium', 'High', 'Supeeeeeer High'
]



In [61]:
# ..and then we use np.select(conditions, values) on the DataFrame

df.Volume = np.select(conditions, values)
df.Volume.value_counts()

Low                529591
Medium               7368
High                 4517
Supeeeeeer High       433
Name: Volume, dtype: int64

#### Group values by year or month, so we can analyze some trend! 📆

In [63]:
# Let's add 2 new columns:
# 'Year' (YYYY) and 'Month' (YYYYMM)
import datetime as dt # import the datetime library to easily extract the years and months from the Invoice Date

# 1. We have to convert the InvoiceDate column as a datetime data type using `pd.to_datetime(date column)`

df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
Country                object
Amount                float64
Volume                 object
dtype: object

In [64]:
# 2. Then we can use the `datetime` library we imported above by calling `.dt` and use some of the built-in
# date time functions, such as extracting the year, month, day, etc
df['Year'] = df.InvoiceDate.dt.year
df


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Amount,Volume,Year
0,536365,85123A,White Hanging Heart T-Light Holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30,Low,2010
1,536365,71053,White Metal Lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00,Low,2010
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010
4,536365,84029E,Red Woolly Hottie White Heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010
...,...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,Pack Of 20 Spaceboy Napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20,Low,2011
541905,581587,22899,Children'S Apron Dolly Girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60,Low,2011
541906,581587,23254,Childrens Cutlery Dolly Girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Low,2011
541907,581587,23255,Childrens Cutlery Circus Parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Low,2011


In [65]:
df['Month'] = df.InvoiceDate.dt.strftime('%Y%m') # YYYYMM
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Amount,Volume,Year,Month
0,536365,85123A,White Hanging Heart T-Light Holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30,Low,2010,201012
1,536365,71053,White Metal Lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00,Low,2010,201012
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012
4,536365,84029E,Red Woolly Hottie White Heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012
...,...,...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,Pack Of 20 Spaceboy Napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20,Low,2011,201112
541905,581587,22899,Children'S Apron Dolly Girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60,Low,2011,201112
541906,581587,23254,Childrens Cutlery Dolly Girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Low,2011,201112
541907,581587,23255,Childrens Cutlery Circus Parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Low,2011,201112


### Filter our data to Sales only data

Say, we only want to see the actual sales data, without all the refund records.<br>
Let's filter the data to only rows with:

* `Quantity` higher than 0
* `UnitPrice` higher than 0
* `CustomerID` not null or empty

In [None]:
# let's write our filter conditions


<br><br>
Let's save this new filtered DataFrame into a new DataFrame so we can use it later for analysis 🤔

In [67]:
# ... and save this filtered data to a new variable called `sales`
sales = df[(df.Quantity > 0) & (df.UnitPrice.astype('int') > 0) & (df.CustomerID.notna())]
sales.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Amount,Volume,Year,Month
0,536365,85123A,White Hanging Heart T-Light Holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,Low,2010,201012
1,536365,71053,White Metal Lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012
2,536365,84406B,Cream Cupid Hearts Coat Hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,Low,2010,201012
3,536365,84029G,Knitted Union Flag Hot Water Bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012
4,536365,84029E,Red Woolly Hottie White Heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Low,2010,201012


In [None]:
# let's quickly check our sales data that everything is in order


In [68]:
# The index number is looking weird, but we can reset it with #reset_index()
sales = sales.reset_index()

⭐️ Read more about filtering your data:

https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9



<br><br>
***

## Analyze & Visualize Data

### Pivot Table like in Excel 💪

![image.png](attachment:image.png)

- With pd.pivot_table()
- `pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)`
                    


In [None]:
# Let's do a pivot table that summarize the data by the 'Country' column
# and calculate the SUM of amount and COUNT of Customers
# let's store this into a new variable called `sales_pvt`


### Data Visualization with Matplotlib

#### Basic Charts & Graphs with matplotlib library

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 
# to show the plot after the code

# Below is some `runtime configuration` to set certain styling by default
# plt.rc('axes', titlesize=18)     # fontsize of the axes title
# plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
# plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
# plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
# plt.rc('legend', fontsize=13)    # legend fontsize
# plt.rc('font', size=13)          # controls default text sizes

Let's create a simple line chart that shows the Sales $$ trend by month 📈

In [None]:
# Let's start with a pivot table that summarize the data by the month


In [None]:
# Let's create a simple line chart that shows the Sales $$ trend by month
# plt.figure(figsize=(8,4), tight_layout=True)
# syntax -> plt.plot(xAxis, yAxis, options)


# Let's make the axis labels more readable
# And also let's add the chart title, and the X and Y axis labels

# Let's also add customer growth by month into the plot
# Let's add the chart legend



Let's create a simple bar chart that shows the top 10 Country by Revenue 📊

In [None]:
# First create a new data frame and store it in a variable called `top_10`


In [None]:
# Now we can plot the data


<br><br>
***

## Awesome job everyone! We will stop here today 😇

Thank you for joining this workshop ❤️

There is still so much to learn, but we hope it's a good start 💪

xxx  👩🏻‍💻

![image.png](attachment:image.png)