Lambda School Data Science

*Unit 3, Med Cabinet Build*

---

In [95]:
import pandas as pd


pd.set_option('display.max_rows', 500)
df = pd.read_csv('cannabis.csv')
df = df.dropna()

df.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy,Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


In [96]:
# df.loc[df['SEARCH_HERE'].isnull() == True]

df.isnull().sum()

Strain         0
Type           0
Rating         0
Effects        0
Flavor         0
Description    0
dtype: int64

## Step 1 - Effects/Flavor Search Preparation

We need to break out the effects for each strain into something searchable. To do this we're going to grab the number of unique entries, this way we can know how any new columns we'll have for encoding.

In [97]:
# This is what one row looks like on efects

df['Effects'][0]

'Creative,Energetic,Tingly,Euphoric,Relaxed'

In [98]:
# Python considers this a string

type(df['Effects'][0])

str

In [99]:
# Pandas has an option to turn strings in a series into lists through the split method.
# Since methods run across the whole series we need to tell it to focus on the strings for each and split that,
# otherwise it thinks we're trying to split the series, which makes no sense.

df['Effects_List'] = df['Effects'].str.split(',')
df['Flavor_List']  = df['Flavor'].str.split(',')

df['Effects_List']

0       [Creative, Energetic, Tingly, Euphoric, Relaxed]
1         [Relaxed, Aroused, Creative, Happy, Energetic]
2        [Uplifted, Happy, Relaxed, Energetic, Creative]
3          [Tingly, Creative, Hungry, Relaxed, Uplifted]
4        [Happy, Relaxed, Euphoric, Uplifted, Talkative]
                              ...                       
2346     [Happy, Uplifted, Relaxed, Euphoric, Energetic]
2347        [Relaxed, Happy, Euphoric, Uplifted, Sleepy]
2348       [Relaxed, Sleepy, Talkative, Euphoric, Happy]
2349          [Relaxed, Sleepy, Euphoric, Happy, Hungry]
2350          [Hungry, Relaxed, Uplifted, Happy, Sleepy]
Name: Effects_List, Length: 2318, dtype: object

In [100]:
df['Flavor_List']

0              [Earthy, Sweet, Citrus]
1            [Flowery, Violet, Diesel]
2         [Spicy, Herbal, Sage, Woody]
3        [Apricot, Citrus, Grapefruit]
4             [Citrus, Earthy, Orange]
                     ...              
2346             [Earthy, Woody, Pine]
2347             [Sweet, Berry, Grape]
2348    [Earthy, Sweet, Spicy, Herbal]
2349          [Sweet, Earthy, Pungent]
2350          [Berry, Earthy, Pungent]
Name: Flavor_List, Length: 2318, dtype: object

In [101]:
# Now python sees the field as a list.

type(df['Effects_List'][0])

list

In [102]:
df['Effects_List'][0]

['Creative', 'Energetic', 'Tingly', 'Euphoric', 'Relaxed']

In [103]:
# From here we can see that while some are below 5, none surpass it.
# We might need to do something about that, but for now we can ignore it.

df['Effects_List'].str.len()

0       5
1       5
2       5
3       5
4       5
       ..
2346    5
2347    5
2348    5
2349    5
2350    5
Name: Effects_List, Length: 2318, dtype: int64

In [104]:
# So we can see we have 15 unique values. Now we need to encode this.

print(len(df['Effects_List'].apply(pd.Series).stack().value_counts())
     ,df['Effects_List'].apply(pd.Series).stack().value_counts()
     ,len(df['Flavor_List'].apply(pd.Series).stack().value_counts())
     ,df['Flavor_List'].apply(pd.Series).stack().value_counts()
     )

16 Happy        1842
Relaxed      1705
Euphoric     1614
Uplifted     1485
Creative      733
Sleepy        729
Energetic     632
Focused       592
Hungry        472
Talkative     356
Tingly        340
Giggly        287
Aroused       197
None           85
Uplifting       2
Dry             1
dtype: int64 66 Earthy         1099
Sweet          1048
Citrus          523
Pungent         441
Berry           354
Pine            298
Flowery         265
Woody           254
Diesel          238
Spicy           228
Herbal          226
Lemon           191
Skunk           169
Tropical        153
Blueberry       144
Grape           126
None            121
Orange           76
Cheese           67
Pepper           59
Lime             52
Strawberry       47
Pineapple        41
Minty            41
Sage             39
Grapefruit       38
Chemical         38
Lavender         37
Vanilla          34
Mango            33
Honey            32
Tree             31
Fruit            31
Ammonia          28
Nutty        

## Step 2 - Encoding

In [105]:
# These two do the same thing, but they don't work with lists.
# They also only work with EXACT MATCHES.

df.loc[df['Effects'] == 'Creative']

df.loc[df['Effects'].isin(['Creative'])]

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description,Effects_List,Flavor_List
355,Blukashima,hybrid,5.0,Creative,,Using a Chernobyl male plant to pollenate thei...,[Creative],[None]
369,Brain-Candy,hybrid,5.0,Creative,Sweet,Brain Candy by Insanity Strains is a handy hyb...,[Creative],[Sweet]


### MultiLabelBinarizer

In [106]:
from sklearn.preprocessing import MultiLabelBinarizer


mlb = MultiLabelBinarizer()

print(
pd.DataFrame(mlb.fit_transform(df['Effects_List'])
            ,columns = mlb.classes_
            ,index   = df.index
            )
     )

      Aroused  Creative  Dry  Energetic  Euphoric  Focused  Giggly  Happy  \
0           0         1    0          1         1        0       0      0   
1           1         1    0          1         0        0       0      1   
2           0         1    0          1         0        0       0      1   
3           0         1    0          0         0        0       0      0   
4           0         0    0          0         1        0       0      1   
...       ...       ...  ...        ...       ...      ...     ...    ...   
2346        0         0    0          1         1        0       0      1   
2347        0         0    0          0         1        0       0      1   
2348        0         0    0          0         1        0       0      1   
2349        0         0    0          0         1        0       0      1   
2350        0         0    0          0         0        0       0      1   

      Hungry  None  Relaxed  Sleepy  Talkative  Tingly  Uplifted  Uplifting

In [112]:
df2 = pd.DataFrame(mlb.fit_transform(df['Effects_List'])
                  ,columns = mlb.classes_
                  ,index   = df.index
                  )

df3 = pd.DataFrame(mlb.fit_transform(df['Flavor_List'])
                  ,columns = mlb.classes_
                  ,index   = df.index
                  )
print(df2.head()
     ,df3.head()
     )

   Aroused  Creative  Dry  Energetic  Euphoric  Focused  Giggly  Happy  \
0        0         1    0          1         1        0       0      0   
1        1         1    0          1         0        0       0      1   
2        0         1    0          1         0        0       0      1   
3        0         1    0          0         0        0       0      0   
4        0         0    0          0         1        0       0      1   

   Hungry  None  Relaxed  Sleepy  Talkative  Tingly  Uplifted  Uplifting  
0       0     0        1       0          0       1         0          0  
1       0     0        1       0          0       0         0          0  
2       0     0        1       0          0       0         1          0  
3       1     0        1       0          0       1         1          0  
4       0     0        1       0          1       0         1          0      Acrid  Ammonia  Apple  Apricot  Berry  Blue  Blueberry  Butter  Cabernet  \
0      0        0      0  

## Step 3 - Merge

In [110]:
df = df.merge(df2, left_index = True, right_index = True)
df = df.merge(df3, left_index = True, right_index = True)

In [111]:
df.drop(['Effects_List', 'Flavor_List'], axis = 1)

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description,Aroused,Creative,Dry,Energetic,...,Tangy,Tar,Tea,Terpene,Tobacco,Tree,Tropical,Vanilla,Violet,Woody
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...,1,1,0,1,...,0,0,0,0,0,0,0,0,1,0
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy,Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...,0,1,0,1,...,0,0,0,0,0,0,0,0,0,1
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2346,Zeus-Og,hybrid,4.7,"Happy,Uplifted,Relaxed,Euphoric,Energetic","Earthy,Woody,Pine",Zeus OG is a hybrid cross between Pineapple OG...,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2347,Zkittlez,indica,4.6,"Relaxed,Happy,Euphoric,Uplifted,Sleepy","Sweet,Berry,Grape",Zkittlez is an indica-dominant mix of Grape Ap...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2348,Zombie-Kush,indica,5.0,"Relaxed,Sleepy,Talkative,Euphoric,Happy","Earthy,Sweet,Spicy,Herbal",Zombie Kush by Ripper Seeds comes from two dif...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2349,Zombie-Og,indica,4.4,"Relaxed,Sleepy,Euphoric,Happy,Hungry","Sweet,Earthy,Pungent",If you’re looking to transform into a flesh-ea...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
