# Data Cleaning Demo Project

# Steps are:-

- Import data and assign appropriate datatype based on sample data
- Detecting and treating missing values
- Look for rows that need to be cleaned and trimmed.
- Please data cleaning report and share it with the client

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

In [2]:
myfile = 'assets/PRICES-STOCK.csv'
BranchProductdata = pd.read_csv(myfile,sep='|')

In [3]:
BranchProductdata.head()

Unnamed: 0,SKU,BRANCH,PRICE,STOCK
0,142308,HRO,45.6,3
1,223240,RHSM,15.3,5
2,49514,RHSM,56.7,-2
3,111807,RHSM,48.0,-4
4,223543,MS,8.7,-1


2296767

### Check null cases

In [4]:
BranchProductdata.isnull().sum()/len(BranchProductdata)*100

SKU        0.000000
BRANCH    19.960884
PRICE      0.000000
STOCK      0.000000
dtype: float64

## Key Findings

- Branch information is not available for 19.96% of the record. Upon analysis, we decided to remove those records as it did not give any significant insight to the data. Sample rows has been attached in the email.


## Analyse the null cases

In [5]:
BranchProductdataNullCases = BranchProductdata[BranchProductdata['BRANCH'].notnull()]
BranchProductdataNullCases

Unnamed: 0,SKU,BRANCH,PRICE,STOCK
0,142308,HRO,45.6,3
1,223240,RHSM,15.3,5
2,49514,RHSM,56.7,-2
3,111807,RHSM,48.0,-4
4,223543,MS,8.7,-1
...,...,...,...,...
2296762,243017,HRO,73.2,4
2296763,185713,MS,85.5,2
2296764,296517,RHSM,59.4,3
2296765,52814,MS,23.7,-1


## Filter to take 'MM' and 'RHSM' branch

In [6]:
BranchFilter = BranchProductdataNullCases.loc[BranchProductdataNullCases['BRANCH'].isin(['MM','RHSM'])]
set(BranchFilter['BRANCH'])

{'MM', 'RHSM'}

# Import Product data

In [20]:
Productfile = 'assets/PRODUCTS.csv'
Productdata = pd.read_csv(Productfile,sep='|')

In [8]:
Productdata.head()

Unnamed: 0,SKU,BUY_UNIT,DESCRIPTION_STATUS,ORGANIC_ITEM,KIRLAND_ITEM,FINELINE_NUMBER,BARCODES,NAME,DESCRIPTION,IMAGE_URL,CATEGORY,SUB_CATEGORY,SUB_SUB_CATEGORY,BRAND
0,52572,UN,B,,False,713115,693177767936,CANASTOS,<p>CANASTO CONEJO F1 A 1UN</p>,https://locahost:8000/images/693177767936?heig...,APPAREL,MENS WEAR,MEN - T-SHIRTS,PUPPY DOG PALS
1,278100,,R,,True,345827,399,LIMÓN COLIMA KG,<p>LIMÓN COLIMA KG </p>,https://locahost:8000/images/399?height=500&wi...,ENTERTAINMENT,TOYS & HOBBIES,PELUCHES,ENTRECOT
2,27404,KG,B,,True,134923,762230099043,CHICLES,<p>TRIDENT 6S SANDIA 9GR</p>,https://locahost:8000/images/762230099043?heig...,HARDLINES,AUTOMOTIVE,APARIENCIA AUTOMOVIL,SPURA
3,215143,UN,CD,Y,True,773663,7501199416236,JABON LIQUIDO PARA MANOS VITAMINA E,<p>JABON LIQUIDO PARA MANOS VITAMINA E 442 ML....,https://locahost:8000/images/7501199416236?hei...,HARDLINES,SEASONAL,SEASONAL - EASTER,CHAMYTO
4,85805,KG,CD,N,False,233313,8410113003119,VINO TINTO GRAN CORONAS TORRES 750,<p>VINO TINTO GRAN CORONAS TORRES 750 1 PZA</p>,https://locahost:8000/images/8410113003119?hei...,CONSUMABLES,HOUSEHOLD CHEMICALS,AIR FRESHENER-DEODORIZER,POOPSIE


In [9]:
Productdata.dtypes

SKU                    int64
BUY_UNIT              object
DESCRIPTION_STATUS    object
ORGANIC_ITEM          object
KIRLAND_ITEM            bool
FINELINE_NUMBER        int64
BARCODES               int64
NAME                  object
DESCRIPTION           object
IMAGE_URL             object
CATEGORY              object
SUB_CATEGORY          object
SUB_SUB_CATEGORY      object
BRAND                 object
dtype: object

In [10]:
Productdata['BUY_UNIT'] = Productdata['BUY_UNIT'].astype('category')
Productdata['DESCRIPTION_STATUS'] = Productdata['DESCRIPTION_STATUS'].astype('category')
Productdata['BRAND'] = Productdata['BRAND'].astype('category')

In [11]:
Productdata.select_dtypes(include = 'number').head()

Unnamed: 0,SKU,FINELINE_NUMBER,BARCODES
0,52572,713115,693177767936
1,278100,345827,399
2,27404,134923,762230099043
3,215143,773663,7501199416236
4,85805,233313,8410113003119


### Analyze null cases

In [12]:
Productdata.isnull().sum()/len(Productdata)*100

SKU                    0.000000
BUY_UNIT              33.303581
DESCRIPTION_STATUS    20.068296
ORGANIC_ITEM          33.330327
KIRLAND_ITEM           0.000000
FINELINE_NUMBER        0.000000
BARCODES               0.000000
NAME                   0.003732
DESCRIPTION            0.000000
IMAGE_URL              0.000000
CATEGORY               0.459038
SUB_CATEGORY           0.459038
SUB_SUB_CATEGORY       0.459038
BRAND                  0.041052
dtype: float64

In [23]:
Productdata2 = Productdata[Productdata['ORGANIC_ITEM'].isnull()]
Productdata2

Unnamed: 0,SKU,BUY_UNIT,DESCRIPTION_STATUS,ORGANIC_ITEM,KIRLAND_ITEM,FINELINE_NUMBER,BARCODES,NAME,DESCRIPTION,IMAGE_URL,CATEGORY,SUB_CATEGORY,SUB_SUB_CATEGORY,BRAND
0,52572,UN,B,,False,713115,693177767936,CANASTOS,<p>CANASTO CONEJO F1 A 1UN</p>,https://locahost:8000/images/693177767936?heig...,APPAREL,MENS WEAR,MEN - T-SHIRTS,PUPPY DOG PALS
1,278100,,R,,True,345827,399,LIMÓN COLIMA KG,<p>LIMÓN COLIMA KG </p>,https://locahost:8000/images/399?height=500&wi...,ENTERTAINMENT,TOYS & HOBBIES,PELUCHES,ENTRECOT
2,27404,KG,B,,True,134923,762230099043,CHICLES,<p>TRIDENT 6S SANDIA 9GR</p>,https://locahost:8000/images/762230099043?heig...,HARDLINES,AUTOMOTIVE,APARIENCIA AUTOMOVIL,SPURA
12,9510,KG,,,True,686777,780413300603,HUEVO TRADICIONAL,<p>H.PRIMERA COLOR.12 12UN</p>,https://locahost:8000/images/780413300603?heig...,HOME,BEDDING,CAMA ACCESORIOS,IMPRECOO
21,191131,KG,B,,False,938864,7501198350944,BEBIDA TE LIMON,<p>BEBIDA TE LIMON 473 ML.</p>,https://locahost:8000/images/7501198350944?hei...,BABY WORLD,INFANT SOFTLINES,BOTTOM,CASA LOBO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321532,141957,,CD,,False,215603,8033737232095,PIMIENTA NEGRA,<p>PIMIENTA NEGRA 80 GRS</p>,https://locahost:8000/images/8033737232095?hei...,HARDLINES,LAWN & GARDEN,MACETEROS,DELIA
321533,57936,UN,B,,False,321437,40000680310,BERMUDA JEANS DENIM,<p>BERMUDA LINEA 2 1UN</p>,https://locahost:8000/images/40000680310?heigh...,GROCERY,DAIRY PRODUCTS,CREAM,SAN JOSE
321536,127410,UN,CD,,True,285494,2502200000005,QUESO MOSARELLA ADOBADO,<p>QUESO MOSARELLA ADOBADO 0 NDF</p>,https://locahost:8000/images/2502200000005?hei...,APPAREL,SOCKS,LADY - CALCETINES,MY LITTLE BOUTIQUE
321540,312844,KG,B,,True,305744,6971165068775,"CAMISA JORDACHE J, BLANCO, EG/XL","<p>CAMISA JORDACHE J, BLANCO, EG/XL</p>",https://locahost:8000/images/6971165068775?hei...,GROCERY,92 GROCERY,MILK-CREAM-SUBSTITUTES-SHELF,MEYER


In [24]:
Productdata['BUY_UNIT'] = Productdata['BUY_UNIT'].fillna('N/A')
Productdata['DESCRIPTION_STATUS'] = Productdata['DESCRIPTION_STATUS'].fillna('N/A')
Productdata['ORGANIC_ITEM'] = Productdata['ORGANIC_ITEM'].fillna('N/A')

In [25]:
Productdata.isnull().sum()

SKU                      0
BUY_UNIT                 0
DESCRIPTION_STATUS       0
ORGANIC_ITEM             0
KIRLAND_ITEM             0
FINELINE_NUMBER          0
BARCODES                 0
NAME                    12
DESCRIPTION              0
IMAGE_URL                0
CATEGORY              1476
SUB_CATEGORY          1476
SUB_SUB_CATEGORY      1476
BRAND                  132
dtype: int64

In [27]:
Productdata['DESCRIPTION'] = Productdata['DESCRIPTION'].replace({'<p>':'','</p>':''},regex = True)
Productdata

Unnamed: 0,SKU,BUY_UNIT,DESCRIPTION_STATUS,ORGANIC_ITEM,KIRLAND_ITEM,FINELINE_NUMBER,BARCODES,NAME,DESCRIPTION,IMAGE_URL,CATEGORY,SUB_CATEGORY,SUB_SUB_CATEGORY,BRAND
0,52572,UN,B,,False,713115,693177767936,CANASTOS,CANASTO CONEJO F1 A 1UN,https://locahost:8000/images/693177767936?heig...,APPAREL,MENS WEAR,MEN - T-SHIRTS,PUPPY DOG PALS
1,278100,,R,,True,345827,399,LIMÓN COLIMA KG,LIMÓN COLIMA KG,https://locahost:8000/images/399?height=500&wi...,ENTERTAINMENT,TOYS & HOBBIES,PELUCHES,ENTRECOT
2,27404,KG,B,,True,134923,762230099043,CHICLES,TRIDENT 6S SANDIA 9GR,https://locahost:8000/images/762230099043?heig...,HARDLINES,AUTOMOTIVE,APARIENCIA AUTOMOVIL,SPURA
3,215143,UN,CD,Y,True,773663,7501199416236,JABON LIQUIDO PARA MANOS VITAMINA E,JABON LIQUIDO PARA MANOS VITAMINA E 442 ML.,https://locahost:8000/images/7501199416236?hei...,HARDLINES,SEASONAL,SEASONAL - EASTER,CHAMYTO
4,85805,KG,CD,N,False,233313,8410113003119,VINO TINTO GRAN CORONAS TORRES 750,VINO TINTO GRAN CORONAS TORRES 750 1 PZA,https://locahost:8000/images/8410113003119?hei...,CONSUMABLES,HOUSEHOLD CHEMICALS,AIR FRESHENER-DEODORIZER,POOPSIE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321537,37542,UN,,N,False,966232,698400300544,LOZA BASICA GRA,TAZA CONSOME 280CC 1UN,https://locahost:8000/images/698400300544?heig...,HARDLINES,HARDWARE,VENTILACION,HO DON CUSTODIO
321538,199851,,R,Y,True,855034,7501199404301,DETERGENTE LIQUIDO COLOR,DETERGENTE LIQUIDO COLOR 5 LT,https://locahost:8000/images/7501199404301?hei...,GROCERY,DAIRY PRODUCTS,BUTTER-SUBSTITUTES,MADEMSA
321539,144269,,R,Y,True,434058,7501006555752,2 SOBRES PALOMITAS + VALENTINA,2 SOBRES PALOMITAS + VALENTINA 1 PZA,https://locahost:8000/images/7501006555752?hei...,ENTERTAINMENT,ELECTRONICS,EQUIPOS DE AUDIO,DON JUAN
321540,312844,KG,B,,True,305744,6971165068775,"CAMISA JORDACHE J, BLANCO, EG/XL","CAMISA JORDACHE J, BLANCO, EG/XL",https://locahost:8000/images/6971165068775?hei...,GROCERY,92 GROCERY,MILK-CREAM-SUBSTITUTES-SHELF,MEYER


In [31]:
Productdata['CATEGORY'] = Productdata['CATEGORY'].str.title()+ '| '+Productdata['SUB_CATEGORY'].str.title()+'| '+Productdata['SUB_SUB_CATEGORY'].str.title()
set(Productdata['DESCRIPTION'])

{'JAMON DE PAVO 1 KG.',
 'CAFE CAPUCCINO 660 ML.',
 'THE LEGO MOVIE 2.-PS4 1 PZA',
 'TERMO 16 OZ ROSA 1 PZA',
 'CARPA CASTILLO GRIS 1UN',
 'DULCES MENTA CREMA SIN AZUCAR 75 GRS',
 'CEREAL TRIX 150 GRS',
 'BROGAL T ADULTO 120 ML.',
 'CAFE ORGANICO 500 GRS',
 'MANTEL DE PLÁSTICO PALA PARY MORADO 1 P',
 'DURAZNO REBANO EN ALMIBAR 820 GRS',
 'CHOC ALMENDRAS S/AZ 75GR',
 'TRESEMME CO OIL RADIANTE 12X750ML 750 ML.',
 'PL&AACUTE;STICO PARA ENVOLVER 1 PZA',
 'MUÑECA FAMOSA JAGGETS 700013785',
 'VJGO PRO EVOLUTION SOCCER 2017 PS3 KONAM',
 'TAGLIATELLE MOLTO BENNE 250 GRS',
 'YOGHURT MANGO CEREAL 130 GRS',
 'PRESONE 5MG. C 10 TABS 1 PZA',
 'MOUSSE RIZOS PERFECTOS 250 GRS',
 'MIEL DE MANTEQUILLA 300 ML.',
 'CONJUNTOS BA 3-6M 1UN',
 'CREMA CORPORAL CUERPO  UV 25% GRATIS 50M 500 ML.',
 'PAPAS  CREMA CEBOLLA 149 GRS',
 'ENDULZANTE SELECTO STEVIA 300 GRS',
 'ENTEL PLAN BAM 80MB 1UN',
 'QUESO TORTA DEL CASAR DONA ENGRACIA 400 GRS',
 'BEBIDA PREMIUM CHINOTTO 200 ML.',
 'MUCOCEF 500 MG/8.7 MG 21 CAPS 1 

In [33]:
Productdata['Package'] =  np.select(
            [
                Productdata['DESCRIPTION'].str.contains("UN"),
                Productdata['DESCRIPTION'].str.contains("KG"),
                Productdata['DESCRIPTION'].str.contains("LT"),
                Productdata['DESCRIPTION'].str.contains("PZA"),
                Productdata['DESCRIPTION'].str.contains("ML"),
                Productdata['DESCRIPTION'].str.contains("GRS"),

            ],
    [
        'UN',
        'KG',
        'LT',
        'PZA',
        'ML',
        'GRS'
        
    ],
    default = 'N/A'

)

In [35]:
del Productdata['SUB_CATEGORY']
del Productdata['SUB_SUB_CATEGORY']

In [36]:
Productdata

Unnamed: 0,SKU,BUY_UNIT,DESCRIPTION_STATUS,ORGANIC_ITEM,KIRLAND_ITEM,FINELINE_NUMBER,BARCODES,NAME,DESCRIPTION,IMAGE_URL,CATEGORY,BRAND,Package
0,52572,UN,B,,False,713115,693177767936,CANASTOS,CANASTO CONEJO F1 A 1UN,https://locahost:8000/images/693177767936?heig...,Apparel| Mens Wear| Men - T-Shirts| Mens Wear|...,PUPPY DOG PALS,UN
1,278100,,R,,True,345827,399,LIMÓN COLIMA KG,LIMÓN COLIMA KG,https://locahost:8000/images/399?height=500&wi...,Entertainment| Toys & Hobbies| Peluches| Toys ...,ENTRECOT,KG
2,27404,KG,B,,True,134923,762230099043,CHICLES,TRIDENT 6S SANDIA 9GR,https://locahost:8000/images/762230099043?heig...,Hardlines| Automotive| Apariencia Automovil| A...,SPURA,
3,215143,UN,CD,Y,True,773663,7501199416236,JABON LIQUIDO PARA MANOS VITAMINA E,JABON LIQUIDO PARA MANOS VITAMINA E 442 ML.,https://locahost:8000/images/7501199416236?hei...,Hardlines| Seasonal| Seasonal - Easter| Season...,CHAMYTO,ML
4,85805,KG,CD,N,False,233313,8410113003119,VINO TINTO GRAN CORONAS TORRES 750,VINO TINTO GRAN CORONAS TORRES 750 1 PZA,https://locahost:8000/images/8410113003119?hei...,Consumables| Household Chemicals| Air Freshene...,POOPSIE,PZA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
321537,37542,UN,,N,False,966232,698400300544,LOZA BASICA GRA,TAZA CONSOME 280CC 1UN,https://locahost:8000/images/698400300544?heig...,Hardlines| Hardware| Ventilacion| Hardware| Ve...,HO DON CUSTODIO,UN
321538,199851,,R,Y,True,855034,7501199404301,DETERGENTE LIQUIDO COLOR,DETERGENTE LIQUIDO COLOR 5 LT,https://locahost:8000/images/7501199404301?hei...,Grocery| Dairy Products| Butter-Substitutes| D...,MADEMSA,LT
321539,144269,,R,Y,True,434058,7501006555752,2 SOBRES PALOMITAS + VALENTINA,2 SOBRES PALOMITAS + VALENTINA 1 PZA,https://locahost:8000/images/7501006555752?hei...,Entertainment| Electronics| Equipos De Audio| ...,DON JUAN,PZA
321540,312844,KG,B,,True,305744,6971165068775,"CAMISA JORDACHE J, BLANCO, EG/XL","CAMISA JORDACHE J, BLANCO, EG/XL",https://locahost:8000/images/6971165068775?hei...,Grocery| 92 Grocery| Milk-Cream-Substitutes-Sh...,MEYER,


## Findings

- __Branch__ information is not available for 19.96% of the record. Upon analysis, we decided to remove those records as it did not give any significant insight to the data. Sample rows has been attached in the email.
- __BUY_UNIT__ is not available for 34% of the records. We have marked those records as Not Avaliable (N/A).
- __DESCRIPTION_STATUS__ is not available for 20% of the records.We have marked those records as Not Avaliable (N/A).
- __ORGANIC_ITEM__ is not available for 33% of the records.We have marked those records as Not Avaliable (N/A).
- We were able to findout __Package__ information like _'UN'_,_'KG'_,_'LT'_,_'PZA'_,_'ML'_,_'GRS'_ from description. Please provide list of all packages that we have so that we can clean and add remaining packages.