# Searching for the category

For this code along we are only going to use the products DataFrame. However, if you believe there is information in other tables that can help to create categories, please feel free to explore.

In [1]:
import pandas as pd

In [2]:
# products_cl.csv
url = "https://drive.google.com/file/d/1s7Lai4NSlsYjGEPg1QSOUJobNYVsZBOJ/view?usp=sharing"
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
products_cl = pd.read_csv(path)
#products_cl = pd.read_csv('products_cl.csv')

In [3]:
product_category_df = products_cl.copy()

In [4]:
product_category_df.head()

Unnamed: 0,sku,name,desc,price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364


## 1.&nbsp; Category creation by search term
Let's start by creating a column `category`. For now we'll fill this column with a blank string `""`.

In [5]:
product_category_df["category"] = ""
product_category_df.head()

Unnamed: 0,sku,name,desc,price,in_stock,type,category
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696,
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,0,13855401,
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,0,1387,
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,0,1230,
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364,


We can find all the products with certain words in their `description` using `.loc[]` and `.str.contains()`. Here we'll look at all the items that have the word `keyboard` in their description.

In [6]:
product_category_df.loc[product_category_df["desc"].str.contains("keyboard", case=False)]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.00,0,13855401,
15,MOS0021,Clearguard Moshi MacBook Pro and Air,Keyboard Protector MacBook Pro 13-inch Retina ...,24.95,0,13835403,
24,APP0277,Apple Wireless Keyboard Keyboard (OEM) Mac,Ultrathin keyboard Apple Bluetooth Spanish (un...,79.00,0,13855401,
64,HGD0012,Henge Docks Click keyboard support iMac,Base to hold the Apple Magic TrackPad and Wire...,29.00,0,8696,
365,LOG0084,Logitech Ultrathin Keyboard Cover Keyboard Cov...,Ultrathin cover and cover with Bluetooth keybo...,89.99,0,12575403,
...,...,...,...,...,...,...,...
9720,PAC2508,Replacement Magic Wireless Keyboard by Matias ...,Keyboard replacement service at the time of pu...,119.99,1,13855401,
9751,MTF0008,Mistify Clean Screens Natural 500ml.,Spray cleaning screens and keyboards.,14.99,1,12085400,
9796,ZAG0026-A,Open - Zagg Rugged Keyboard Folio iPad Messeng...,Case reconditioned keyboard and adjustable pos...,99.99,0,12575403,
9932,APP1472,Apple Magic Keyboard English International,English keyboard Mac and Apple iPad Ultrathin ...,119.00,1,13855401,


Next, we change the value in the category column to `keyboard` for all of these keyboard products.

In [7]:
product_category_df.loc[product_category_df["desc"].str.contains("keyboard", case=False), "category"] = "keyboard"

Let's take a look at the effect that had on the `category` column.

In [8]:
product_category_df["category"].value_counts()

category
            9903
keyboard      89
Name: count, dtype: int64

## 2.&nbsp; Category creation using regex
We can also use a product's `name` to select products for our categories.

In [9]:
product_category_df.loc[product_category_df["name"].str.contains("apple iphone", case=False)]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
35,APP0308,AV Cable Adapter Apple iPhone iPad and iPod white,IPhone iPad iPod adapter and AV cable.,45.00,0,1230,
214,REP0100,Color change to White Apple iPhone 4,It is including parts and labor..,94.21,0,"1,44E+11",
215,REP0052,Color change to White Apple iPhone 4,It is including parts and labor..,94.21,0,"1,44E+11",
579,APP0675,Apple iPhone 5S 32GB Space Gray,New Free iPhone 5S 32GB (ME435Y / A).,559.00,0,,
956,APP0823,Apple iPhone 6 16GB Silver,New iPhone 6 16GB Free (MG482QL / A).,639.00,0,,
...,...,...,...,...,...,...,...
9790,AP20455,Like new - Apple iPhone 8 256GB Gold,Apple iPhone 8 reconditioned 256GB in Gold rea...,979.00,0,113291716,
9794,APP2482-A,Open - Apple iPhone 8 Plus 256GB Gold,Refurbished Apple iPhone 8 Plus 256GB Free Gold,1089.00,0,113281716,
9929,APP2477-A,Open - Apple iPhone 8 Plus 64GB Space Gray,Apple iPhone 8 Plus 64GB Space Gray,919.00,0,113281716,
9958,AP20467,Like new - Apple iPhone Silicone Case Cover 7 ...,Reconditioned silicone sleeve microfiber Apple...,45.00,0,11865403,


Looks like we get a lot of accessories included in this search. We can refine this using a little regex. Here, we will add `.{0,7}` at the beginning of the search: this means we will find all `apple iphone`s that have 7 or less characters preceding the term "apple iphone" - if there's 8 characters preceding the search term, it won't be found. This should help refine our search by using the nomenclature of the DataFrame to our advantage.

If you feel unsure about regex, please use [regex101](https://regex101.com/). It's really useful for checking your code, and parts of other people's code that you're unsure about.

In [10]:
product_category_df.loc[product_category_df["name"].str.contains("^.{0,7}apple iphone", case=False)]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
579,APP0675,Apple iPhone 5S 32GB Space Gray,New Free iPhone 5S 32GB (ME435Y / A).,559.0,0,,
956,APP0823,Apple iPhone 6 16GB Silver,New iPhone 6 16GB Free (MG482QL / A).,639.0,0,,
961,APP0829,Apple iPhone 6 Plus 16GB Silver,New iPhone 6 Plus 16G Free (MGA92QL / A).,749.0,0,,
962,APP0822,Apple iPhone 6 16GB Space Gray,New iPhone 6 16GB Free (MG472QL / A).,639.0,0,,
963,APP0825,Apple iPhone 6 64GB Space Gray,New iPhone 6 64GB Free (MG4F2QL / A).,749.0,0,,
...,...,...,...,...,...,...,...
9585,APP1634-A,Open - Apple iPhone 7 Plus 32GB Black,New 32GB Apple iPhone 7 Plus Free Black,779.0,0,85651716,
9587,APP2540-A,Open - Apple iPhone Leather Folio X Baya,Leather case with box and official cover Apple,109.0,0,11865403,
9714,APP2562-A,Open - Apple iPhone Leather Case Cover Red,Reconditioned skin sheath official Apple desig...,45.0,0,11865403,
9794,APP2482-A,Open - Apple iPhone 8 Plus 256GB Gold,Refurbished Apple iPhone 8 Plus 256GB Free Gold,1089.0,0,113281716,


Now we can use the same trick as before to set the category - selecting the `category` column and setting it to the string of our choice.

In [11]:
product_category_df.loc[product_category_df["name"].str.contains("^.{0,7}apple iphone", case=False), "category"] = "smartphone"

In [12]:
product_category_df["category"].value_counts()

category
              9634
smartphone     269
keyboard        89
Name: count, dtype: int64

## 3.&nbsp; One product with multiple categories
A product may fit into multiple categories. To help us create multiple categories for one product, we will use the python addition assignment `+=`. The addition assignment is a shorthand way to add something (number, string, etc...) to a variable without changing the variable name.

Let's have a look at a couple of examples.

In [13]:
a = 10
a = a + 5
a

15

In [14]:
a = 10
a += 5
a

15

In [15]:
b = "Tyrannosaurus"
b = b + " rex"
b

'Tyrannosaurus rex'

In [16]:
b = "Tyrannosaurus"
b += " rex"
b

'Tyrannosaurus rex'

Now let's look at how this can help us in our category creation.

First, we'll reset all the values in the category column to an empty string `""`.

In [17]:
product_category_df["category"] = ""

Now, let's create some categories and utilise the addition assignment.

In [18]:
product_category_df.loc[product_category_df["desc"].str.contains("keyboard", case=False), "category"] += ", keyboard"
product_category_df.loc[product_category_df["name"].str.contains("^.{0,7}apple iphone", case=False), "category"] += ", smartphone"
product_category_df.loc[product_category_df["name"].str.contains("^.{0,7}apple ipod", case=False), "category"] += ", ipod"
product_category_df.loc[product_category_df["name"].str.contains("^.{0,7}apple ipad|tablet", case=False), "category"] += ", tablet"
product_category_df.loc[product_category_df["name"].str.contains("imac|mac mini|mac pro", case=False), "category"] += ", desktop"

In [19]:
product_category_df["category"].value_counts()

category
                       8362
, desktop               923
, tablet                307
, smartphone            269
, keyboard               83
, ipod                   42
, keyboard, tablet        4
, keyboard, desktop       2
Name: count, dtype: int64

As you can see, some products now have 2 categories instead of just one. At the end, you can use your skills with string to tidy up the opening comma and space in the `category` column.

# Challenge. Your categories
Now it's your turn. We'll reset the Dataframe so that no categories exist, and it's up to you to create the categories based on keywords in the name and description. Feel free to go wild and make as many categories as you like.
* Remember you can also use regex to refine your searches.
* Remember you can use the or operator `|` to search for multiple terms at once.
* Remember to tidy up any untidy strings at the end.

In [20]:
# products_cl.csv
url = "https://drive.google.com/file/d/1s7Lai4NSlsYjGEPg1QSOUJobNYVsZBOJ/view?usp=sharing"
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
products_cl = pd.read_csv(path)
#products_cl = pd.read_csv('Vasil_products_cl.csv')

prod_df = products_cl.copy()

In [21]:
# your code here
prod_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9992 entries, 0 to 9991
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sku       9992 non-null   object 
 1   name      9992 non-null   object 
 2   desc      9992 non-null   object 
 3   price     9992 non-null   float64
 4   in_stock  9992 non-null   int64  
 5   type      9946 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 468.5+ KB


In [22]:
prod_df['category'] = ''

In [23]:
prod_df.loc[prod_df.desc.str.contains('support', case=False), 'category'] = 'support'
prod_df.loc[prod_df.desc.str.contains('support', case=False), 'category'].count()

261

In [24]:
prod_df.loc[prod_df.desc.str.contains('keyboard|keypad', case=False) | prod_df.name.str.contains('keyboard|keypad', case=False), 'category'] += 'keyboard, '
prod_df.loc[prod_df.desc.str.contains('keyboard|keypad', case=False) | prod_df.name.str.contains('keyboard|keypad', case=False), 'category'].count()

108

In [25]:
prod_df.loc[prod_df.desc.str.contains('backpack', case=False) | prod_df.name.str.contains('backpack', case=False), 'category'] += 'backpack, '
prod_df.loc[prod_df.desc.str.contains('backpack', case=False) | prod_df.name.str.contains('backpack', case=False), 'category'].count()

60

In [26]:
prod_df.loc[prod_df.desc.str.contains('battery', case=False) | prod_df.name.str.contains('battery', case=False), 'category'] += 'battery, '
prod_df.loc[prod_df.desc.str.contains('battery', case=False) | prod_df.name.str.contains('battery', case=False), 'category'].count()

324

In [27]:
prod_df.loc[prod_df.desc.str.contains('headset|headphone|earphone', case=False) | prod_df.name.str.contains('headset|headphone|earphone', case=False), 'category'] += 'headphone, '
prod_df.loc[prod_df.desc.str.contains('headset|headphone|earphone', case=False) | prod_df.name.str.contains('headset|headphone|earphone', case=False), 'category'].count()

210

In [28]:
#prod_df.loc[prod_df.desc.str.contains('headset|headphone|earphone', case=False) | prod_df.name.str.contains('headset|headphone|earphone', case=False)]

### clean up the 'category' strings

In [29]:
prod_df.loc[prod_df.category.str.contains(', $', regex=True), 'category']

1        keyboard, 
11      headphone, 
15       keyboard, 
24       keyboard, 
25        battery, 
           ...     
9985     backpack, 
9988     backpack, 
9989     backpack, 
9990     backpack, 
9991     backpack, 
Name: category, Length: 676, dtype: object

In [30]:
prod_df.category = prod_df.category.str.replace(', $', '', regex=True)

In [31]:
prod_df.category

0        support
1       keyboard
2               
3               
4               
          ...   
9987            
9988    backpack
9989    backpack
9990    backpack
9991    backpack
Name: category, Length: 9992, dtype: object

In [32]:
prod_df

Unnamed: 0,sku,name,desc,price,in_stock,type,category
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696,support
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.00,0,13855401,keyboard
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.00,0,1387,
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.00,0,1230,
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364,
...,...,...,...,...,...,...,...
9987,BEL0376,Belkin Travel Support Apple Watch Black,compact and portable stand vertically or horiz...,29.99,1,12282,
9988,THU0060,"Enroute Thule 14L Backpack MacBook 13 ""Black",Backpack with capacity of 14 liter compartment...,69.95,1,1392,backpack
9989,THU0061,"Enroute Thule 14L Backpack MacBook 13 ""Blue",Backpack with capacity of 14 liter compartment...,69.95,1,1392,backpack
9990,THU0062,"Enroute Thule 14L Backpack MacBook 13 ""Red",Backpack with capacity of 14 liter compartment...,69.95,0,1392,backpack


## 4.&nbsp; [BONUS] Using `type` to create categories
There could be another way to create categories, but this one you'll have to explore this one alone.

We have the mysterious column `type` in the `products` table. This could potentially be ready-made categories labelled with numbers instead of words. Let's investigate.

In [33]:
category_type_df = products_cl.copy()

Here are the `type`s that have the most products.

In [34]:
category_type_df.groupby("type").count().nlargest(10, "sku")

Unnamed: 0_level_0,sku,name,desc,price,in_stock
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11865403,1057,1057,1057,1057,1057
12175397,939,939,939,939,939
1298,783,783,783,783,783
11935397,562,562,562,562,562
11905404,454,454,454,454,454
1282,373,373,373,373,373
12635403,362,362,362,362,362
13835403,269,269,269,269,269
"5,74E+15",247,247,247,247,247
1364,216,216,216,216,216


Let's have a look at the first `type` to see if we can make categories from this column.

In [35]:
category_type_df.loc[category_type_df["type"] == "11865403"].sample(10)

Unnamed: 0,sku,name,desc,price,in_stock,type
1162,OTT0055,Otterbox Defender Case iPhone 6 Gray,Ultra Rugged Case for iPhone 6.,49.99,0,11865403
5119,MUJ0024,Mujjo Leather Black Leather Case iPhone 7 Plus,ultrathin Case for iPhone vegetable tanned lea...,44.9,0,11865403
949,PUR0120,Puro Booklet Cover iPhone 6 Silver,Case closed book card case for iPhone 6 47 inc...,19.99,0,11865403
1528,LIF0046,LifeProof Fre Waterproof Case iPhone 6 Blue,waterproof and extreme conditions for iPhone 6...,79.99,0,11865403
4612,OTT0134,Otterbox iPhone Case Symmetry 2.0 SE / 5s / 5 ...,resistant cover and thin beveled edges for iPh...,34.99,0,11865403
5263,MOS0197,Moshi iGlaze Case Black iPhone 7/8,rigid case with compact design and shock prote...,30.0,1,11865403
943,MUV0141,Muvit Slim Folio iPhone 6 Pink,Cover with support function with lid 47 iPhone...,15.99,0,11865403
3049,TUC0237,Housing Sottile Tucano iPhone 6 / 6S Gray,Cover for iPhone 6 / 6S.,7.9,0,11865403
6619,BEL0273-A,Open - Air Protect SheerForce Case Belkin iPho...,Case technology against impact and wear resist...,24.99,0,11865403
8204,ELA0056,Elago S8 Empire Polycarbonate Case iPhone X Ro...,Protection best suited to your iPhone rose gol...,28.99,1,11865403


Looks like this is a category of phone cases.

Let's have a look at the 2nd largest type to see if that's also a clear category.

In [36]:
category_type_df.loc[category_type_df["type"] == "12175397"].sample(10)

Unnamed: 0,sku,name,desc,price,in_stock,type
3360,PAC1639,Pack QNAP TS-253A | 4G RAM | 6TB Seagate Desktop,QNAP Pack + 4GB memory RAM + 12TB (2x6TB) Seag...,949.97,1,12175397
3668,PAC1277,QNAP TS-451 Pack | Seagate Desktop 12TB,Pack QNAP TS-451 + 12TB (4x3TB) Seagate Hard D...,870.29,0,12175397
2927,PAC1160,Synology Pack DS216SE | Seagate 6TB IronWolf,Pack + Synology 6TB (2x3TB) Seagate Hard Drive...,419.97,1,12175397
3279,PAC1638,Pack QNAP TS-451 + | 8GB RAM | Seagate NAS 32TB,TS-451 + NAS with 40TB + 8GB RAM memory (4x10T...,2199.95,0,12175397
5968,QNA0194,QNAP TS-463U-RP NAS Server | 4GB RAM,NAS 4-bay rack format includes a port 10GbE tr...,1305.59,0,12175397
4382,PAC1425,Synology DS916 + Pack | 2GB RAM | WD 24TB Network,Synology DS916 + with 2GB RAM memory + 24TB (4...,1600.8,0,12175397
3365,QNA0156,QNAP TS-653A | 4GB RAM Mac and PC Server NAS,NAS server 4 bays and 4 GB RAM for small busin...,845.79,0,12175397
4389,PAC1853,Synology DS916 + | 8GB RAM | 32TB (4x8TB) Seag...,DS916 + NAS with 32TB + 8GB RAM memory (4x8TB)...,2069.16,0,12175397
3381,PAC1342,Pack QNAP TS-453A | 8GB RAM | WD 16TB Network,Pack NAS QNAP TS-453A with 8GB of RAM memory +...,1376.59,0,12175397
8664,PAC2451,Synology DS918 + NAS Server | 16GB RAM | 32TB ...,NAS server of the Plus Series for companies se...,2212.67,0,12175397


Looks like this category is full of servers.

I wonder how many `type`s account for most of our products?

In [37]:
n = 30
print(f"With the {n} largest types, we account for {((category_type_df.groupby('type').count().nlargest(n, 'sku')['sku'].sum()) / (category_type_df.shape[0]) * 100).round(2)}% of all products.")

With the 30 largest types, we account for 78.4% of all products.


Looks like we can simply investigate 30 types and set the categories, then the remaining 20% of products can have the category `other`.

Use the skills you learnt above to change the category for each type.

## Here I build the categories with the types that have the most products. Top 10

In [38]:
category_type_df['category'] = ''

In [39]:
category_type_df.loc[category_type_df["type"] == "11865403", 'category'] = 'phone cases, '
category_type_df.loc[category_type_df["type"] == "11865403", 'category'].count()

1057

In [40]:
category_type_df.loc[category_type_df["type"] == "12175397", 'category'] += 'servers, '
category_type_df.loc[category_type_df["type"] == "12175397", 'category'].count()

939

In [41]:
pd.set_option("display.max_colwidth", 100)

In [42]:
category_type_df.loc[category_type_df["type"] == "1298"]#, 'category'] = 'support'
#category_type_df.loc[category_type_df["type"] == "1298", 'category'].count()

Unnamed: 0,sku,name,desc,price,in_stock,type,category
453,BEL0127-A,Open - Belkin MIXIT Lightning iPhone Support,Loading dock + support synchronization with Lightning connection and USB cable for iPhone 7 / SE...,34.99,0,1298,
1071,SEA0043-A,"Open - Seagate Barracuda 1TB 35 ""SATA 7200rpm hard drive Mac and PC",internal hard drive for Mac and PC Refurbished 1TB (ST1000DM003),59.00,0,1298,
1445,WDT0175-A,"Open - Western Digital 2TB Green 35 ""5400rpm hard drive Mac and PC",WD Internal Hard Drive 2TB Mac and PC.,90.00,0,1298,
1499,NTE0056-A,"Open - NewerTech NuPower Battery 65W MacBook Pro 13 ""2009/14",MacBook Pro 13 inch Battery 2009/14,131.99,0,1298,
1511,BNQ0018-A,"(Open) LED Monitor BenQ VW2235H 215 """,Monitor 215 inch high range.,163.00,0,1298,
...,...,...,...,...,...,...,...
8670,LGE0061-A,"Open - LG 43UD79-B Monitor 425 ""4K 72% NTSC USB-C Speakers DisplayPort",425 inch monitor 5ms response DisplayPort and USB 4 HDMI connections NTSC-C 72% for Mac and PC,845.99,0,1298,
8671,QNA0190-A,Open - QNAP TS-853U NAS Server | 4GB RAM,8-bay NAS server for small businesses with 4 Ethernet ports for Mac and PC,1511.29,0,1298,
8879,KAN0034-A,"Open - Kanex USB-C Gigabit Ethernet Adapter MacBook 12 ""","Open - Kanex USB-C Gigabit Ethernet Adapter MacBook 12 """,29.99,0,1298,
9469,ZAG0024-A,Open - Zagg Folio Case with Keyboard Cover iPad Air 2 Black,Reconditioned Case with Bluetooth Keyboard for iPad Air 2 Spanish.,79.99,0,1298,


In [43]:
category_type_df.loc[category_type_df["type"] == "11935397"]#, 'category'] = 'support'
#category_type_df.loc[category_type_df["type"] == "11935397", 'category'].count()

Unnamed: 0,sku,name,desc,price,in_stock,type,category
63,LAC0141,LaCie Porsche Design 1TB External Hard Drive Mac and PC,External Hard Drive Mac and PC USB 3.0 1TB,79.99,0,11935397,
149,OWC0020,Envoy OWC USB 3.0 Case for MacBook Air SSD 2010/2011,Box Portable External USB 2.0 / 3.0 for MacBook Air SSD.,60.99,1,11935397,
174,OWC0009,OWC Case External SuperSlim for SuperDrive MacBook / MacBook Pro,Black External Case for Apple SuperDrive.,48.99,1,11935397,
175,OWC0039,"OWC Mercury On-The-Go Pro Mini transparent box 25 ""FW800 / FW400 / USB3",outer case 25 inch SATA connection FW800 / FW400 / USB 3.0.,96.99,1,11935397,
176,OWC0040,"OWC Mercury On-The-Go Pro Mini transparent box 25 ""USB3 / USB2",outer case 25 inch SATA USB 3.0 and 2.0.,54.99,1,11935397,
...,...,...,...,...,...,...,...
9818,OWC0280,OWC Envoy Pro EX 250GB SSD M.2 PCIe Thunderbolt 3,outer box with SSD M.2 inch 250GB for Mac and PC,338.99,0,11935397,
9819,OWC0281,OWC Envoy Pro EX M.2 PCIe SSD 1TB Thunderbolt 3,outer box with SSD M.2 inch 1TB for Mac and PC,725.99,0,11935397,
9854,WDT0333-A,Open - Western Digital My Passport Ultra 4TB USB 3.0 Black,External hard disk capacity 4TB reconditioned and USB 3.0 connection for Mac and PC,189.99,0,11935397,
9862,TRA0025-A,Open - Transcend StoreJet SSD 500 512GB External Hard Disk Thunderbolt / USB 3.0,External Hard Drive 512GB SSD refitted with Thunderbolt and USB 3.0 for Mac and PC.,520.00,0,11935397,


In [44]:
category_type_df.loc[category_type_df["type"] == "11905404", 'category'] += 'music, '
category_type_df.loc[category_type_df["type"] == "11905404", 'category'].count()

454

In [45]:
category_type_df.loc[category_type_df["type"] == "1282", 'category'] += 'apple computers, '
category_type_df.loc[category_type_df["type"] == "1282", 'category'].count()

373

In [46]:
category_type_df.loc[category_type_df["type"] == "12635403", 'category'] += 'ipad cases, '
category_type_df.loc[category_type_df["type"] == "12635403", 'category'].count()

362

In [47]:
category_type_df.loc[category_type_df["type"] == "13835403", 'category'] += 'macbook cases, '
category_type_df.loc[category_type_df["type"] == "13835403", 'category'].count()

269

In [48]:
category_type_df.loc[category_type_df["type"] == "5,74E+15", 'category'] += 'desktop computers, '
category_type_df.loc[category_type_df["type"] == "5,74E+15", 'category'].count()

247

In [49]:
category_type_df.loc[category_type_df["type"] == "1364", 'category'] += 'memory, '
category_type_df.loc[category_type_df["type"] == "1364", 'category'].count()

216

## I put 'other' on all the rows without a category yet

In [50]:
category_type_df.loc[category_type_df["category"] == ""].count()

sku         6075
name        6075
desc        6075
price       6075
in_stock    6075
type        6029
category    6075
dtype: int64

In [51]:
category_type_df.loc[category_type_df["category"] == "", 'category'] = 'other'

In [52]:
category_type_df.category.value_counts(normalize=True)

category
other                  0.607986
phone cases,           0.105785
servers,               0.093975
music,                 0.045436
apple computers,       0.037330
ipad cases,            0.036229
macbook cases,         0.026922
desktop computers,     0.024720
memory,                0.021617
Name: proportion, dtype: float64

### clean up the 'category' strings

In [53]:
category_type_df.loc[category_type_df.category.str.contains(', $', regex=True), 'category']

4              memory, 
6              memory, 
7              memory, 
8              memory, 
15      macbook cases, 
             ...       
9965    macbook cases, 
9966    macbook cases, 
9967    macbook cases, 
9968    macbook cases, 
9972      phone cases, 
Name: category, Length: 3917, dtype: object

In [54]:
category_type_df.category = category_type_df.category.str.replace(', $', '', regex=True)

In [55]:
category_type_df.category.value_counts(normalize=True)

category
other                0.607986
phone cases          0.105785
servers              0.093975
music                0.045436
apple computers      0.037330
ipad cases           0.036229
macbook cases        0.026922
desktop computers    0.024720
memory               0.021617
Name: proportion, dtype: float64