# Market Positioning:

**Question**: Which product will have the highest ROI?

**Context**: The Baby-Boomers (people born just after WWII) are now retiring; they have money, free time, and open minds.  
You and some friends from business school want to meet demand with supply by offering discreet access to premium content.  
Which product should you offer?  
Consider:
  * The cost of procurement
  * Risks (financial & legal)
  * Going market rates
  * Target market preferences
  * ...
  
**Audience**: A shady venture capitalist.  
Not strong with numbers, but she knows the market and can see right through any BS. So get to the point.  

**Data**: https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age


**project source**: https://github.com/elewa-academy/data-science/blob/master/infinite-practice.md

In [1]:
#download the dataset "drug-use-by-age.csv" from the github repo

import requests                         #load the tool to get raw web data
import numpy as np                      #load the tool for numerical data analysis and matrices
import pandas as pd                     #load the tool to translate the HTML data into a useful DataFrame to work with
url_of_dataset_1 = "https://github.com/fivethirtyeight/data/blob/master/drug-use-by-age/drug-use-by-age.csv"
x = requests.get(url_of_dataset_1)
#print (x.text)                         #the data is of the HTML form, I choose to use the Pandas method "read_html"
drug_use_by_age = pd.read_html(x.text)  #translate the data into a DataFrame

In [2]:
drug_use_by_age[0].T   #we move the columns vertically to better visualize the data we have

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Unnamed: 0,,,,,,,,,,,,,,,,,
age,12,13,14,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22-23,24-25,26-29,30-34,35-49,50-64,65+
n,2798,2757,2792,2956.0,3058.0,3038.0,2469.0,2223.0,2271.0,2354.0,4707,4591,2628,2864,7391,3923,2448
alcohol-use,3.9,8.5,18.1,29.2,40.1,49.3,58.7,64.6,69.7,83.2,84.2,83.1,80.7,77.5,75,67.2,49.3
alcohol-frequency,3,6,5,6.0,10.0,13.0,24.0,36.0,48.0,52.0,52,52,52,52,52,52,52
marijuana-use,1.1,3.4,8.7,14.5,22.5,28.0,33.7,33.4,34.0,33.0,28.4,24.9,20.8,16.4,10.4,7.3,1.2
marijuana-frequency,4,15,24,25.0,30.0,36.0,52.0,60.0,60.0,52.0,52,60,52,72,48,52,36
cocaine-use,0.1,0.1,0.1,0.5,1.0,2.0,3.2,4.1,4.9,4.8,4.5,4,3.2,2.1,1.5,0.9,0
cocaine-frequency,5.0,1.0,5.5,4.0,7.0,5.0,5.0,5.5,8.0,5.0,5.0,6.0,5.0,8.0,15.0,36.0,-
crack-use,0,0,0,0.1,0.0,0.1,0.4,0.5,0.6,0.5,0.5,0.5,0.4,0.5,0.5,0.4,0


## 1. Who do we want to sell to ?  
The Baby-Boomers (people born just after WWII) in America  
These people (Americans) were born between 1945 and 1955  
today (2018) they are of the age 73 and 63  
We need to know what "today" means in the dataset:  
*source*: https://www.icpsr.umich.edu/icpsrweb/content/SAMHDA/index.html  
*report*: https://fivethirtyeight.com/features/how-baby-boomers-get-high/  

**survey data: 2012  
This would mean they were at age 67 and 57 in this dataset !**

In [3]:
drug_use_by_age = drug_use_by_age[0]                        # the dataframe is the first element in a list, let's clean this up
drug_use_by_age.drop(['Unnamed: 0'], axis=1, inplace=True)  # remove the "Unnamed: 0" column

In [4]:
i = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16]       # list of index ID's to be removed, we only want #15 ~50-64 age group
drug_use_by_age.drop(i, axis=0, inplace=True)     # removing the list i, by stating we want to use the vertical axis=0
drug_use_by_age.T

Unnamed: 0,15
age,50-64
n,3923
alcohol-use,67.2
alcohol-frequency,52
marijuana-use,7.3
marijuana-frequency,52
cocaine-use,0.9
cocaine-frequency,36.0
crack-use,0.4
crack-frequency,62.0


## 2. Verify the Data Veracity and Value

This is a WIDE dataset, more complex but less prone to errors in the data

  * check the value types are numerical (float)
  * check the meaning of the variables
  * check the meaning and integrity/corruption of the values
  * How was the data collected ? (biases or imprecisions during collection)

In [5]:
drug_use_by_age.info(verbose=True)    #check the value types
# 1. all are non-null, this is good
# 2. float64(20), int64(1), object(7), we need to check/fix these 7 object types
# 3. the age column is an object, we can remove this
# 4. substance-USE means a percentage % of those ... in the past 12 months,
#    let's change for example 0.05 into 5% and round it off: int type
# 5. substance-FREQUENCY means, median number of times a user ... in the past 12 months,
#    let's change for example 5.0 into 5 x times and round it off: int type

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 15 to 15
Data columns (total 28 columns):
age                        1 non-null object
n                          1 non-null int64
alcohol-use                1 non-null float64
alcohol-frequency          1 non-null float64
marijuana-use              1 non-null float64
marijuana-frequency        1 non-null float64
cocaine-use                1 non-null float64
cocaine-frequency          1 non-null object
crack-use                  1 non-null float64
crack-frequency            1 non-null object
heroin-use                 1 non-null float64
heroin-frequency           1 non-null object
hallucinogen-use           1 non-null float64
hallucinogen-frequency     1 non-null float64
inhalant-use               1 non-null float64
inhalant-frequency         1 non-null object
pain-releiver-use          1 non-null float64
pain-releiver-frequency    1 non-null float64
oxycontin-use              1 non-null float64
oxycontin-frequency        1 n

In [6]:
# 1. ok
# 2. check 7 object types
for i in drug_use_by_age.loc[15]:
    print ("{} of the type: {}".format(i, type(i)))

# and indeed they are of the type string !

50-64 of the type: <class 'str'>
3923 of the type: <class 'numpy.int64'>
67.2 of the type: <class 'numpy.float64'>
52.0 of the type: <class 'numpy.float64'>
7.3 of the type: <class 'numpy.float64'>
52.0 of the type: <class 'numpy.float64'>
0.9 of the type: <class 'numpy.float64'>
36.0 of the type: <class 'str'>
0.4 of the type: <class 'numpy.float64'>
62.0 of the type: <class 'str'>
0.1 of the type: <class 'numpy.float64'>
41.0 of the type: <class 'str'>
0.3 of the type: <class 'numpy.float64'>
44.0 of the type: <class 'numpy.float64'>
0.2 of the type: <class 'numpy.float64'>
13.5 of the type: <class 'str'>
2.5 of the type: <class 'numpy.float64'>
12.0 of the type: <class 'numpy.float64'>
0.4 of the type: <class 'numpy.float64'>
5.0 of the type: <class 'str'>
1.4 of the type: <class 'numpy.float64'>
10.0 of the type: <class 'numpy.float64'>
0.3 of the type: <class 'numpy.float64'>
24.0 of the type: <class 'numpy.float64'>
0.2 of the type: <class 'numpy.float64'>
30.0 of the type: <clas

In [7]:
# 3. let's drop the age column
drug_use_by_age.drop(["age"], axis=1, inplace=True)

In [9]:
# 4. fix the <substance>-use into integers
x = drug_use_by_age.columns
x

Index(['n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')

In [11]:
import re
#loading regex / regular expression tools
#also go to https://regex101.com/  for getting the right code
regex_use = r"[\w-]+-use"                               # here we select the words with -use
substance_use = re.findall(regex_use, " ".join(x))      # here we make the list into one long string and then find all "...-use" syntax
substance_use

['alcohol-use',
 'marijuana-use',
 'cocaine-use',
 'crack-use',
 'heroin-use',
 'hallucinogen-use',
 'inhalant-use',
 'pain-releiver-use',
 'oxycontin-use',
 'tranquilizer-use',
 'stimulant-use',
 'meth-use',
 'sedative-use']

In [12]:
drug_use_by_age[substance_use].T    # show all data about substance "use"

Unnamed: 0,15
alcohol-use,67.2
marijuana-use,7.3
cocaine-use,0.9
crack-use,0.4
heroin-use,0.1
hallucinogen-use,0.3
inhalant-use,0.2
pain-releiver-use,2.5
oxycontin-use,0.4
tranquilizer-use,1.4


In [14]:
[str(i) for i in range(2)]

['0', '1']

In [15]:
#for i in substance_use:
#    print ( int(drug_use_by_age[i].loc[15] * 100) )
#    drug_use_by_age[i].loc[15] = int(drug_use_by_age[i].loc[15] * 100)  #<-- pandas does not like this method it is dirty

drug_use_by_age[substance_use] = [ round(float(drug_use_by_age[i].loc[15]*100), 0) for i in substance_use]
# same as the for loop in comments
# using int(round(value, 0) ... to round the values and make them clean integer values

In [16]:
# 5. fix the <substance>-frequency into floats instead of str
x = drug_use_by_age.columns
x

Index(['n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')

In [17]:
import re
#loading regex / regular expression tools
#also go to https://regex101.com/  for getting the right code
regex_use = r"[\w-]+-frequency"                               # here we select the words with -frequency
substance_frequency = re.findall(regex_use, " ".join(x))      # here we make the list into one long string and then find all "...-use" syntax
substance_frequency

['alcohol-frequency',
 'marijuana-frequency',
 'cocaine-frequency',
 'crack-frequency',
 'heroin-frequency',
 'hallucinogen-frequency',
 'inhalant-frequency',
 'pain-releiver-frequency',
 'oxycontin-frequency',
 'tranquilizer-frequency',
 'stimulant-frequency',
 'meth-frequency',
 'sedative-frequency']

In [18]:
drug_use_by_age[substance_frequency].T    # show all data about substance "frequency"

Unnamed: 0,15
alcohol-frequency,52.0
marijuana-frequency,52.0
cocaine-frequency,36.0
crack-frequency,62.0
heroin-frequency,41.0
hallucinogen-frequency,44.0
inhalant-frequency,13.5
pain-releiver-frequency,12.0
oxycontin-frequency,5.0
tranquilizer-frequency,10.0


In [19]:
drug_use_by_age[substance_frequency] = [ round(np.float64(drug_use_by_age[i].loc[15]), 0) for i in substance_frequency]
# using int(round(value, 0) ... to round the values and make them clean values

In [20]:
###############################################
drug_use_by_age.info(verbose=True)    #check the value types
#ok, so now the data is cleaned

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 15 to 15
Data columns (total 27 columns):
n                          1 non-null int64
alcohol-use                1 non-null float64
alcohol-frequency          1 non-null float64
marijuana-use              1 non-null float64
marijuana-frequency        1 non-null float64
cocaine-use                1 non-null float64
cocaine-frequency          1 non-null int64
crack-use                  1 non-null float64
crack-frequency            1 non-null int64
heroin-use                 1 non-null float64
heroin-frequency           1 non-null int64
hallucinogen-use           1 non-null float64
hallucinogen-frequency     1 non-null float64
inhalant-use               1 non-null float64
inhalant-frequency         1 non-null int64
pain-releiver-use          1 non-null float64
pain-releiver-frequency    1 non-null float64
oxycontin-use              1 non-null float64
oxycontin-frequency        1 non-null int64
tranquilizer-use           1 non-nu

In [21]:
###############################################
# 2. check the object types again for data "type"
for i in drug_use_by_age.loc[15]:
    print ("{} of the type: {}".format(i, type(i)))
#ok, they are all float numbers

3923.0 of the type: <class 'float'>
6720.0 of the type: <class 'float'>
52.0 of the type: <class 'float'>
730.0 of the type: <class 'float'>
52.0 of the type: <class 'float'>
90.0 of the type: <class 'float'>
36.0 of the type: <class 'float'>
40.0 of the type: <class 'float'>
62.0 of the type: <class 'float'>
10.0 of the type: <class 'float'>
41.0 of the type: <class 'float'>
30.0 of the type: <class 'float'>
44.0 of the type: <class 'float'>
20.0 of the type: <class 'float'>
14.0 of the type: <class 'float'>
250.0 of the type: <class 'float'>
12.0 of the type: <class 'float'>
40.0 of the type: <class 'float'>
5.0 of the type: <class 'float'>
140.0 of the type: <class 'float'>
10.0 of the type: <class 'float'>
30.0 of the type: <class 'float'>
24.0 of the type: <class 'float'>
20.0 of the type: <class 'float'>
30.0 of the type: <class 'float'>
20.0 of the type: <class 'float'>
104.0 of the type: <class 'float'>


### checking the meaning of the variables

**source**: https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age

substance-**use**: Percentage of substance usage in the past 12 months: ( % / 12 month's)  
substance-**frequency**: Median number of times, a user in an age group, used substance in the past 12 months ( median / 12 month's )

In [22]:
drug_use_by_age.T

Unnamed: 0,15
n,3923.0
alcohol-use,6720.0
alcohol-frequency,52.0
marijuana-use,730.0
marijuana-frequency,52.0
cocaine-use,90.0
cocaine-frequency,36.0
crack-use,40.0
crack-frequency,62.0
heroin-use,10.0


### Tidy up the DataFrame structure
  1. columns: "substance", "use", "frequency", "legal", "market_price"
  2. index: just some numbers (reset_index)
  3. clean up n: amount of users
  4. clean up index #15

In [29]:
# create a list of substances, make a list comprehension and split each value where the "-" is and take the first value 
column_names = [x.split("-")[0] for x in drug_use_by_age[substance_frequency].columns]
column_names

['alcohol',
 'marijuana',
 'cocaine',
 'crack',
 'heroin',
 'hallucinogen',
 'inhalant',
 'pain',
 'oxycontin',
 'tranquilizer',
 'stimulant',
 'meth',
 'sedative']

In [30]:
# 1. create a new dataframe where we setup a more compact overview with 5 columns
# 2. this will generate new clean index, no need for a reset
# 3. & 4. by creating a new DataFrame we removed the index 15 and removed "n" = the amount of users
new_df = pd.DataFrame({'substance': column_names,
                       'use': drug_use_by_age[substance_use].values[0],
                       'frequency': drug_use_by_age[substance_frequency].values[0]})
new_df['legal'] = np.nan
new_df['market_price'] = np.nan

In [31]:
new_df       #show the new dataframe table

Unnamed: 0,substance,use,frequency,legal,market_price
0,alcohol,6720.0,52.0,,
1,marijuana,730.0,52.0,,
2,cocaine,90.0,36.0,,
3,crack,40.0,62.0,,
4,heroin,10.0,41.0,,
5,hallucinogen,30.0,44.0,,
6,inhalant,20.0,14.0,,
7,pain,250.0,12.0,,
8,oxycontin,40.0,5.0,,
9,tranquilizer,140.0,10.0,,


IF bought in the US today 2018 on the black market (estimated):  

3-500 dollar/gr = https://www.havocscope.com/black-market-prices/meth-prices/  
20-1800 dollar/ounce = https://www.havocscope.com/black-market-prices/marijuana-prices/  
35.5 dollar/pill = https://www.havocscope.com/black-market-prices/ecstasy-prices/  
30-300 dollar/gr = https://www.havocscope.com/black-market-prices/cocaine-prices/  
110-200 dollar/gr = https://www.havocscope.com/black-market-prices/heroin-prices/  

Conversion: 1000 grams = 1 kilo.
1 pound = 453.592 grams.
1 ounce = 28.3495 grams, thus 1/8th or “eighth” = 3.5 grams.

## 3. Exploratory Data Analysis

Visualize and explore possible correlations and usefulness, need more data