## Module 2: Data wrangling with Python

### Lesson 1: Getting started with a data set

### Part 2.1.1  : Diving into csv data

##### Introduction to Pandas
- Pandas is an open-source Python library providing high-performance data manipulation and analysis tools. 
- The name Pandas is derived from the word Panel Data – an econometrics term for multidimensional data.
- Using Pandas, we can accomplish the five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

##### Key features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

##### Data structures in Pandas
- Series: one dimensional
- Data frame: 2 dimensional
- Panel: 3 dimension and more

In [1]:
# Import the required packages
import pandas as pd

### Pandas comes with several built-in methods to read multiple types of data like Excel, CSV, JSON, HTML etc.

In [2]:
#read the data file
pos_data = pd.read_csv('POS_Data.csv')

In [3]:
# display top few records
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,,0,0.0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0


In [4]:
# display top 8 records
pos_data.head(8)

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,,0,0.0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0
5,SKU1019,3/20/2021,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Colgate,8239,9.53588,864,3543.0
6,SKU1021,04-09-22,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,25243,16.684071,1513,5639.0
7,SKU1044,4/23/2022,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,24707,16.373095,1509,5161.0


In [5]:
# display last 5 records
pos_data.tail()   

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
31180,SKU1307,6/25/2022,Synergix solutions,Beauty and Personal Care,Haircare,Shampoo,Pantene,7728,6.893845,1121,1447.0
31181,SKU1299,09-10-22,Synergix solutions,Beauty and Personal Care,,Shampoo,Dove,24509,16.504377,1485,2663.0
31182,SKU1296,11/26/2022,Synergix solutions,Beauty and Personal Care,,Shampoo,Dove,18704,16.321117,1146,1146.0
31183,SKU1311,12/17/2022,Synergix solutions,Beauty and Personal Care,,Conditioners,Pantene,19292,13.919192,1386,1554.0
31184,SKU1319,12/24/2022,Synergix solutions,Beauty and Personal Care,Haircare,Conditioners,Pantene,22950,17.887763,1283,1117.0


In [6]:
# display last 3 records
pos_data.tail(3)   

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
31182,SKU1296,11/26/2022,Synergix solutions,Beauty and Personal Care,,Shampoo,Dove,18704,16.321117,1146,1146.0
31183,SKU1311,12/17/2022,Synergix solutions,Beauty and Personal Care,,Conditioners,Pantene,19292,13.919192,1386,1554.0
31184,SKU1319,12/24/2022,Synergix solutions,Beauty and Personal Care,Haircare,Conditioners,Pantene,22950,17.887763,1283,1117.0


###### How many rows and columns are there in the dataset? 

In [7]:
# use shape attribute of pandas dataframe to check structure of the data
pos_data.shape

(31185, 11)

###### What are the various attributes of the data?

In [8]:
#list the columns
pos_data.columns

Index(['SKU ID', 'Date', 'Manufacturer', 'Sector', 'Category', 'Segment',
       'Brand', 'Revenue($)', 'Unit_price', 'Units_sold', 'Page_traffic'],
      dtype='object')

### Part 2.1.2  : Data inspection

In [9]:
# what is the type of each attribute?
pos_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31185 entries, 0 to 31184
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SKU ID        31185 non-null  object 
 1   Date          31185 non-null  object 
 2   Manufacturer  31185 non-null  object 
 3   Sector        31135 non-null  object 
 4   Category      31149 non-null  object 
 5   Segment       31156 non-null  object 
 6   Brand         31158 non-null  object 
 7   Revenue($)    31185 non-null  int64  
 8   Unit_price    19635 non-null  float64
 9   Units_sold    31185 non-null  int64  
 10  Page_traffic  31185 non-null  float64
dtypes: float64(2), int64(2), object(7)
memory usage: 2.6+ MB


##### The Page_traffic column indicates the number of visitors to the product page. So, having this number as integer makes more sense.

In [10]:
# convert the attribute into a specific type
pos_data.Page_traffic = pos_data.Page_traffic.astype('int64')

In [11]:
#check the data type of Page_traffic again
pos_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31185 entries, 0 to 31184
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SKU ID        31185 non-null  object 
 1   Date          31185 non-null  object 
 2   Manufacturer  31185 non-null  object 
 3   Sector        31135 non-null  object 
 4   Category      31149 non-null  object 
 5   Segment       31156 non-null  object 
 6   Brand         31158 non-null  object 
 7   Revenue($)    31185 non-null  int64  
 8   Unit_price    19635 non-null  float64
 9   Units_sold    31185 non-null  int64  
 10  Page_traffic  31185 non-null  int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 2.6+ MB


### Part 2.1.3  : Identifying missing data

In [12]:
# find out the missing values in each attribute
pos_data.isna().sum()

SKU ID              0
Date                0
Manufacturer        0
Sector             50
Category           36
Segment            29
Brand              27
Revenue($)          0
Unit_price      11550
Units_sold          0
Page_traffic        0
dtype: int64

In [13]:
pos_data.isna()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
0,False,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...
31180,False,False,False,False,False,False,False,False,False,False,False
31181,False,False,False,False,True,False,False,False,False,False,False
31182,False,False,False,False,True,False,False,False,False,False,False
31183,False,False,False,False,True,False,False,False,False,False,False


**NOTE:**
- Let us understand what `isna()` does.
- It checks for every value in the dataframe for whether it is null or not, and returns a boolean (True or False) value.
- True values are treated as 1 and the False values are treated as 0.
- When we use `sum()` afterwards, all these values are summed up and returned.


In [14]:
# as an alternative to isna(), we can also use isnull()
pos_data.isnull().sum()

SKU ID              0
Date                0
Manufacturer        0
Sector             50
Category           36
Segment            29
Brand              27
Revenue($)          0
Unit_price      11550
Units_sold          0
Page_traffic        0
dtype: int64

**Analysis:** 
- We can see that a lot of values are missing in the attribute *Unit_price*.
- If we closely look at the data, it is clear that *Revenue* is zero on few dates. And *Units_sold* is also zero on the same rows.
- And, if we carefully observe, we can make out that 

$ UnitPrice = \frac{Revenue}{UnitsSold} $
- So, wherever *Revenue* and *Units_sold* are not available, the *Unit_price* is empty, which in-turn treated by Python as missing values.
- During data analysis, we can remove a column from the dataset, if it is a derived attribute from other attributes. This is one of the techniques of feature-engineering, which we will explore more in detail at the later stage.
- So, we will now remove the attribute *Unit_price*.

In [15]:
# drop() method is used for drop a column
# Pandas dataframe is a 2 dimensional data, where rows are indicated as axis 0 and columns are indicated as axis 1

pos_data = pos_data.drop(['Unit_price'], axis=1)
pos_data.head()   #confirm whether the Unit_price column has been removed.

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


In [16]:
pos_data.shape

(31185, 10)

In [17]:
# Now check for the missing values again
pos_data.isnull().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector          50
Category        36
Segment         29
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

##### We can check the proportion of these missing values compared to the original data size

In [18]:
pos_data.isna().sum()/31185

SKU ID          0.000000
Date            0.000000
Manufacturer    0.000000
Sector          0.001603
Category        0.001154
Segment         0.000930
Brand           0.000866
Revenue($)      0.000000
Units_sold      0.000000
Page_traffic    0.000000
dtype: float64

In [19]:
# we divide the number for missing values by the total number of records
pos_data.isna().sum()/pos_data.shape[0]

SKU ID          0.000000
Date            0.000000
Manufacturer    0.000000
Sector          0.001603
Category        0.001154
Segment         0.000930
Brand           0.000866
Revenue($)      0.000000
Units_sold      0.000000
Page_traffic    0.000000
dtype: float64

In [20]:
pos_data.isna().sum()*100/pos_data.shape[0]

SKU ID          0.000000
Date            0.000000
Manufacturer    0.000000
Sector          0.160333
Category        0.115440
Segment         0.092993
Brand           0.086580
Revenue($)      0.000000
Units_sold      0.000000
Page_traffic    0.000000
dtype: float64

**Analysis:**
- The missing values are in very small proportion.
- When we use the data to build machine learning model building, we usually impute the missing values through certain methods.
- However, for the basic data analysis purpose, let us drop the rows containing missing values now.
- We will explain about imputation during EDA in module 4

In [21]:
# use dropna() method to drop the rows containing missing vlaues

pos_data.dropna(axis=0, inplace=True) 

In [22]:
# the data after removing the missing values
pos_data.shape  

(31057, 10)

**NOTE:**
- For the remaining part of this module and for Module 3, we will be using this modified data.
- So, let us save it on the disk for further usage. 

In [23]:
# use to_csv() method to convert a dataframe into a csv file and to store it on the disk. 

pos_data.to_csv('POS_CleanData.csv', index=False)

###### The argument index=False makes sure that the row indices are not being written to the csv file on the disk