Lambda School Data Science

*Unit 3, Med Cabinet Build*

---

In [119]:
import pandas as pd


pd.set_option('display.max_rows', 500)
df = pd.read_csv('cannabis.csv')
df = df.dropna()

df.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Floral,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy,Herbal,Sage,Wood",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


In [120]:
# df.loc[df['SEARCH_HERE'].isnull() == True]

df.isnull().sum()

Strain         0
Type           0
Rating         0
Effects        0
Flavor         0
Description    0
dtype: int64

## Step 1 - Effects/Flavor Search Preparation

We need to break out the effects for each strain into something searchable. To do this we're going to grab the number of unique entries, this way we can know how any new columns we'll have for encoding.

In [121]:
# This is what one row looks like on efects

print(df['Effects'][0])

# Python considers this a string

print(type(df['Effects'][0]))

Creative,Energetic,Tingly,Euphoric,Relaxed
<class 'str'>


In [122]:
# Pandas has an option to turn strings in a series into lists through the split method.
# Since methods run across the whole series we need to tell it to focus on the strings for each and split that,
# otherwise it thinks we're trying to split the series, which makes no sense.

df['Effects_List'] = df['Effects'].str.split(',')
df['Flavor_List']  = df['Flavor'].str.split(',')

df['Effects_List']

0       [Creative, Energetic, Tingly, Euphoric, Relaxed]
1         [Relaxed, Aroused, Creative, Happy, Energetic]
2        [Uplifted, Happy, Relaxed, Energetic, Creative]
3          [Tingly, Creative, Hungry, Relaxed, Uplifted]
4        [Happy, Relaxed, Euphoric, Uplifted, Talkative]
                              ...                       
2346     [Happy, Uplifted, Relaxed, Euphoric, Energetic]
2347        [Relaxed, Happy, Euphoric, Uplifted, Sleepy]
2348       [Relaxed, Sleepy, Talkative, Euphoric, Happy]
2349          [Relaxed, Sleepy, Euphoric, Happy, Hungry]
2350          [Hungry, Relaxed, Uplifted, Happy, Sleepy]
Name: Effects_List, Length: 2318, dtype: object

In [123]:
# Now python sees the field as a list.

print(type(df['Effects_List'][0]))
print(df['Effects_List'][0])

<class 'list'>
['Creative', 'Energetic', 'Tingly', 'Euphoric', 'Relaxed']


In [124]:
# From here we can see that while some are below 5, none surpass it.
# We might need to do something about that, but for now we can ignore it.

df['Effects_List'].str.len()

0       5
1       5
2       5
3       5
4       5
       ..
2346    5
2347    5
2348    5
2349    5
2350    5
Name: Effects_List, Length: 2318, dtype: int64

In [125]:
# So we can see we have 15 unique values. Now we need to encode this.

print(len(df['Effects_List'].apply(pd.Series).stack().value_counts())
     ,df['Effects_List'].apply(pd.Series).stack().value_counts()
     ,len(df['Flavor_List'].apply(pd.Series).stack().value_counts())
     ,df['Flavor_List'].apply(pd.Series).stack().value_counts()
     )

16 Happy        1842
Relaxed      1705
Euphoric     1614
Uplifted     1485
Creative      733
Sleepy        729
Energetic     632
Focused       592
Hungry        472
Talkative     356
Tingly        340
Giggly        287
Aroused       197
None           85
Uplifting       2
Dry             1
dtype: int64 54 Earthy        1099
Sweet         1049
Citrus         523
Pungent        446
Berry          354
Pine           298
Wood           285
Floral         268
Diesel         240
Herbal         228
Spicy          228
Lemon          191
Skunk          169
Tropical       153
Blueberry      153
Grape          127
None           121
Orange          76
Cheese          67
Pepper          59
Lime            52
Strawberry      47
Minty           41
Pineapple       41
Sage            39
Grapefruit      38
Chemical        38
Lavender        37
Fruity          34
Vanilla         34
Mango           33
Honey           32
Ammonia         28
Nutty           25
Coffee          24
Menthol         22
Butter   

In [126]:
df['Flavor_List'].apply(pd.Series).stack().value_counts()

Earthy        1099
Sweet         1049
Citrus         523
Pungent        446
Berry          354
Pine           298
Wood           285
Floral         268
Diesel         240
Herbal         228
Spicy          228
Lemon          191
Skunk          169
Tropical       153
Blueberry      153
Grape          127
None           121
Orange          76
Cheese          67
Pepper          59
Lime            52
Strawberry      47
Minty           41
Pineapple       41
Sage            39
Grapefruit      38
Chemical        38
Lavender        37
Fruity          34
Vanilla         34
Mango           33
Honey           32
Ammonia         28
Nutty           25
Coffee          24
Menthol         22
Butter          19
Mint            18
Tea             18
Apple           16
Rose            16
Apricot          8
Tobacco          8
Violet           7
Chestnut         7
Tar              7
Peach            6
Sour             4
Pear             3
Plum             2
Tangy            1
Candy            1
Melon       

## Step 2 - Encoding

In [127]:
# These two do the same thing, but they don't work with lists.
# They also only work with EXACT MATCHES.

df.loc[df['Effects'] == 'Creative']

df.loc[df['Effects'].isin(['Creative'])]

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description,Effects_List,Flavor_List
355,Blukashima,hybrid,5.0,Creative,,Using a Chernobyl male plant to pollenate thei...,[Creative],[None]
369,Brain-Candy,hybrid,5.0,Creative,Sweet,Brain Candy by Insanity Strains is a handy hyb...,[Creative],[Sweet]


### MultiLabelBinarizer

In [128]:
from sklearn.preprocessing import MultiLabelBinarizer


mlb = MultiLabelBinarizer()

print(
pd.DataFrame(mlb.fit_transform(df['Effects_List'])
            ,columns = mlb.classes_
            ,index   = df.index
            )
     )

      Aroused  Creative  Dry  Energetic  Euphoric  Focused  Giggly  Happy  \
0           0         1    0          1         1        0       0      0   
1           1         1    0          1         0        0       0      1   
2           0         1    0          1         0        0       0      1   
3           0         1    0          0         0        0       0      0   
4           0         0    0          0         1        0       0      1   
...       ...       ...  ...        ...       ...      ...     ...    ...   
2346        0         0    0          1         1        0       0      1   
2347        0         0    0          0         1        0       0      1   
2348        0         0    0          0         1        0       0      1   
2349        0         0    0          0         1        0       0      1   
2350        0         0    0          0         0        0       0      1   

      Hungry  None  Relaxed  Sleepy  Talkative  Tingly  Uplifted  Uplifting

In [129]:
df2 = pd.DataFrame(mlb.fit_transform(df['Effects_List'])
                  ,columns = mlb.classes_
                  ,index   = df.index
                  )

df3 = pd.DataFrame(mlb.fit_transform(df['Flavor_List'])
                  ,columns = mlb.classes_
                  ,index   = df.index
                  )
print(df2.head()
     ,df3.head()
     )

   Aroused  Creative  Dry  Energetic  Euphoric  Focused  Giggly  Happy  \
0        0         1    0          1         1        0       0      0   
1        1         1    0          1         0        0       0      1   
2        0         1    0          1         0        0       0      1   
3        0         1    0          0         0        0       0      0   
4        0         0    0          0         1        0       0      1   

   Hungry  None  Relaxed  Sleepy  Talkative  Tingly  Uplifted  Uplifting  
0       0     0        1       0          0       1         0          0  
1       0     0        1       0          0       0         0          0  
2       0     0        1       0          0       0         1          0  
3       1     0        1       0          0       1         1          0  
4       0     0        1       0          1       0         1          0      Ammonia  Apple  Apricot  Berry  Blueberry  Butter  Candy  Cheese  Chemical  \
0        0      0       

## Step 3 - Merge

In [130]:
df2 = df2.merge(df3, left_index = True, right_index = True)
df  = df.merge(df2, left_index = True, right_index = True)

df  = df.drop(['Effects_List', 'Flavor_List', 'None_x', 'None_y'], axis = 1)

In [131]:
df.columns

Index(['Strain', 'Type', 'Rating', 'Effects', 'Flavor', 'Description',
       'Aroused', 'Creative', 'Dry', 'Energetic', 'Euphoric', 'Focused',
       'Giggly', 'Happy', 'Hungry', 'Relaxed', 'Sleepy', 'Talkative', 'Tingly',
       'Uplifted', 'Uplifting', 'Ammonia', 'Apple', 'Apricot', 'Berry',
       'Blueberry', 'Butter', 'Candy', 'Cheese', 'Chemical', 'Chestnut',
       'Citrus', 'Coffee', 'Diesel', 'Earthy', 'Floral', 'Fruity', 'Grape',
       'Grapefruit', 'Herbal', 'Honey', 'Lavender', 'Lemon', 'Lime', 'Mango',
       'Melon', 'Menthol', 'Mint', 'Minty', 'Nutty', 'Orange', 'Peach', 'Pear',
       'Pepper', 'Pine', 'Pineapple', 'Plum', 'Pungent', 'Rose', 'Sage',
       'Skunk', 'Sour', 'Spicy', 'Strawberry', 'Sweet', 'Tangy', 'Tar', 'Tart',
       'Tea', 'Tobacco', 'Tropical', 'Vanilla', 'Violet', 'Wood'],
      dtype='object')