### Lesson 3: Exploring  and querying columns 

### Part 2.3.1  : Navigating data insights

In [6]:
import pandas as pd

In [7]:
# Loading the dataset which does not contain any missing values
pos_data = pd.read_csv('POS_CleanData.csv')
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


In [8]:
# let's quickly check its shape
pos_data.shape

(31057, 10)

#### What are some basic statistical features of the data?


In [9]:
# use describe() method
pos_data.describe()

Unnamed: 0,Revenue($),Units_sold,Page_traffic
count,31057.0,31057.0,31057.0
mean,14377.151657,701.464469,2051.972051
std,13424.798113,647.234465,1978.47992
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,14926.0,764.0,1960.0
75%,25655.0,1222.0,3646.0
max,48572.0,3386.0,10696.0


**Analysis:**
- The `describe()` method reports statistical features like mean, standard deviation, min, max, percentiles etc. for numerical data.
- It is observed that the average revenue is USD 14377, average number of units sold are 701.
- 25% of revenue values are below 0
- The median value, which is the same as the 50th percentile,  of revenue is USD 14926.
- More details about statistical concepts will be discussed in the Module 3 of this course.

#### The summary of categorical featues also can be displayed using the `describe()` method

In [10]:
# use describe() method
pos_data.describe(include='all')

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
count,31057,31057,31057,31057,31057,31057,31057,31057.0,31057.0,31057.0
unique,380,105,1,3,7,18,23,,,
top,SKU1237,4/30/2022,Synergix solutions,Fabric Care,Laundry Detergents,Liquid,Gain,,,
freq,320,297,31057,15456,10151,7208,5587,,,
mean,,,,,,,,14377.151657,701.464469,2051.972051
std,,,,,,,,13424.798113,647.234465,1978.47992
min,,,,,,,,0.0,0.0,0.0
25%,,,,,,,,0.0,0.0,0.0
50%,,,,,,,,14926.0,764.0,1960.0
75%,,,,,,,,25655.0,1222.0,3646.0


- `describe(include='all)` lists the features of both categorical and numerical attributes. 
- Wherever a certain statistical feature is not applicable, *NaN* (indicating Not a Number) is displayed.
- We see that there are three unique sectors, seven unique categories, 18 unique segments and 23 unique brands of products.
- As the quantitative statistical features like mean, standard deviation, percetiles etc are not applicable to categorical data, *NaN* is displayed.


### Part 2.3.2  : Querying on columns
- We have previously learned how to extract only the required rows and columns using slicing and indexing.
- Sometimes we may want to extract a subset of the data that satisfies certian criteria. 
- To achieve this, we will use relational and logical operators in Python.
- Relational operators like < (less than), > (greater than), <= (less than or equal to), != (not equal to), ==(equal to or comparison) are useful.
- Logical operators like & (and), | (or) and ! (not) are useful.
- Both relational and logical operators result in boolean values (True or False).

In [11]:
# let us quickly look at the data once again
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


#### How many products are of Close-up brand?

In [12]:
# extract all the records where Brand is 'Close-up'
df = pos_data.loc[(pos_data['Brand']=='Close-up')]
df.head()   #display 5 rows, and observe the row indices and Brand column

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
6,SKU1021,04-09-22,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,25243,1513,5639
14,SKU1029,4/16/2022,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,23204,1647,4922
23,SKU1028,4/30/2022,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,24651,1483,4601
35,SKU1029,8/21/2021,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0


**Explanation:**
- When we use relational operator *pos_data['Brand']=='Close-up'* , the condition is iterated on all the rows of the data frame.
- It will return either *True* or *False* against each row index of the dataframe.
- When we use `pos_data.loc[]`, then all the rows with True value are displayed

In [13]:
#Find out how many rows are there in the df
df.shape[0]

342

#### How many products have the revenue in the range of USD 10K and USD 15K?

In [14]:
#extract all the records where revenue is between 10K and 15K
df = pos_data.loc[(pos_data['Revenue($)']>=10000) & (pos_data['Revenue($)']<=15000)]
df.head()   

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
16,SKU1058,01-02-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,11314,834,5709
27,SKU1045,01-02-21,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,12873,620,94
28,SKU1038,3/13/2021,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Colgate,10152,564,535
57,SKU1063,12-11-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,14617,428,769
58,SKU1036,02-12-22,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Colgate,11314,332,720


In [15]:
#Find out how many rows are there in the df
df.shape[0]

2238

#### How many products have the revenue greater than USD 25K and page traffic less than 1000?

In [17]:
# we need two relational operators, linked with a logical operator
df = pos_data.loc[(pos_data['Revenue($)']>=25000) & (pos_data['Page_traffic']<1000) ]
df.head()   

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
152,SKU1049,3/27/2021,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,25850,1034,448
876,SKU1042,10-01-22,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,32082,697,421
1343,SKU1050,3/13/2021,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,25051,1078,966
1410,SKU1050,09-03-22,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,29921,805,494
1527,SKU1022,12-10-22,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,26270,1255,710


In [18]:
# Find out how many rows are there in the df
df.shape[0]

167

***Insight:***
- 167 products have the revenue greater than USD 25K and page traffic less than 1000
- The page traffic indicates the number of people who visited a particular product on e-retailer website.
- So, we can infer than these 167 products have high revenue even though the number of visitors are low.