In [1]:
import pandas as pd
import numpy as np

# Data exploration

Description of the dataset can be found [here](https://www.icpsr.umich.edu/web/ICPSR/studies/34933/variables).

In [2]:
df = pd.read_csv('./data/DS0001/34933-0001-Data.tsv', sep = '\t')
print('Shape of the dataset: ',df.shape)
df.head()

Shape of the dataset:  (55268, 3120)


Unnamed: 0,CASEID,QUESTID2,CIGEVER,CIGOFRSM,CIGWILYR,CIGTRY,CIGYFU,CIGMFU,CIGREC,CIG30USE,...,IIEMPSTY,II2EMSTY,EMPSTAT4,IIEMPST4,II2EMST4,PDEN00,COUTYP2,ANALWT_C,VESTR,VEREP
0,1,50886467,2,4,4,991,9991,91,91,91,...,1,1,99,9,9,2,2,1275.597449,30054,2
1,2,13766883,2,99,99,991,9991,91,91,91,...,1,1,1,1,1,2,2,5191.071173,30031,1
2,3,17772877,2,99,99,991,9991,91,91,91,...,1,1,1,1,1,3,3,419.742011,30056,2
3,4,45622817,1,99,99,13,9999,99,2,93,...,1,1,2,1,1,2,2,1449.303889,30054,1
4,5,17239390,1,99,99,11,9999,99,4,93,...,1,1,1,1,1,1,1,15344.293577,30012,2


## Checking for NAs

In [3]:
naCounts = df.isna().sum()
naCounts[naCounts > 0]

Series([], dtype: int64)

## Create response variables

### Description

The consumption of 12 drugs were examined by this survey. For each drug there were a plenty of different variables recorded. To design the response variables, I followed the paper of *Arie Cohen, Martha D. Harrison (The “Urge to Classify” the Drug User: A Review of Classifications by Pattern of Abuse, 1986)* where they have collected the ways drug users are classified - amoung others - based on their pattern of abuse. They concluded that "Degree of involvment classifications can, in turn, be subdivided into: (1) those directly measuring degree of involvment, (2) those measuring degree of involvment through indexing, and (3) those measuring degree of involvment through mode of administration." Given the nature of the underlying dataset and the research papers, sources considered in the one above (e.g. Cockett (1971), National Commission on Marijuana and Drug Abuse Report (1973) the (1) way of classification was used by grouping the respondents into four groups in the following manner: 

After examining the dataset from closer, the respondents were classified based on the answers to the below listed questions,  either as 

* _non user (Category: 5),_
* _experimental user (Category: 4),_
* _circumstantial user (Category: 3),_
* _intensified user (Category: 2),_
* _compulsive user (Category: 1)_.

The answers to the following questions were used to classify the respondent:
* Have you ever consumed a *something*? (*XY*EVER) 
    * 1 - Yes $\Rightarrow$ Category 1-4
    * 2 - No $\Rightarrow$ Category 5
    
    
* Time since last consumed a *something* (*XY*REC)?
    * 1, 11 - Within the past 30 days $\Rightarrow$ Category 1, 2
    * 2, 8,  - More than 30 days ago but within the past 12 months $\Rightarrow$ Category: 3
    * 3 - More than 12 months ago $\Rightarrow$ Category 4
    
    
* Total # of days used *something* in past 12 months (*XY*YRTOT)?
    * \> 250 (~ 70%) $\Rightarrow$ Category 1
    * \> 180 (~ 50%) $\Rightarrow$ Category 2
    * < 5 & < 180 $\Rightarrow$ Category 3
    * < 1 & < 5 $\Rightarrow$ Category 4
    * 0 $\Rightarrow$ Category 5
    

* What would be the easiest way for you to tell us how many days you've used it (within the past 12 months)? (*XY*BSTWAY) 
    * 1, 11 - Prefer to answer in days per week. $\Rightarrow$ Category 1
    * 2, 12 - Prefer to asnwer in days per month. $\Rightarrow$ Category 2
    * 3, 13 - Prefer to answer in days per year. $\Rightarrow$ Category 3
    * 93 - did not use XY in the past 12 months $\Rightarrow$ Category 4
    * 91 - Never used XY $\Rightarrow$ Category 5
    

* On how many days in the past 12 months did you consume *XY* (*XY*DAYPYR)?
    * \>= 250 (~ 5 times per week) $\Rightarrow$ Category 1
    * \>= 52 (~ once per week) $\Rightarrow$ Category 2
    * <= 5 & < 52 $\Rightarrow$ Category 3
    * < 1 & < 5 $\Rightarrow$ Category 4
    * 0 $\Rightarrow$ Category 5
  
  
* On average, how many days did you consume *XY* each week during the past 12 months (*XY*DAYPWK)?
    * \> 5 $\Rightarrow$ Category 1
    * 2 - 4 $\Rightarrow$ Category 2
    * 1 $\Rightarrow$ Category 3
    * 0 $\Rightarrow$ Category 4, 5


In [4]:
# Disclaimer: https://stackoverflow.com/questions/39358092/range-as-dictionary-key-in-python
class RangeDict(dict):
    def __getitem__(self, item):
        if not isinstance(item, range):
            for key in self:
                if item in key:
                    return self[key]
            raise KeyError(item)
        else:
            return super().__getitem__(item)

In [5]:
categoryLogic = {
    'BSTWAY':{
        1: 1,
        11: 1,
        2: 2,
        12: 2,
        3: 3,
        13: 3,
        93: 4,
        91: 5
    },
    'DAYPYR':RangeDict({
        range(991, 991): 5,
        range(250, 365): 1,
        range(52, 250): 2,
        range(5, 52): 3,
        range(1, 5): 4,
        range(0, 1): 5
    }),
    'DAYPWK':{
        91:5,
        7:1,
        6:1,
        5:1,
        4:2,
        3:2,
        2:2,
        1:3,
        0:5
    }
}

### Analyses before creating the response variables

#### Finding the drugs to be analyzed

In [6]:
variables = ["EVER", "REC", "RTOT", "BSTWAY", "DAYPYR", "DAYPWK", "30EST"]

In [7]:
maxDistinct = 0
minDistinct = 100
maxVar = ''
minVar = ''
for crVar in variables:
    crLength = len(df.columns[pd.Series(df.columns).str.endswith(crVar)])
    print('{} variable can be found for {} drugs.'.format(crVar, crLength))    

EVER variable can be found for 17 drugs.
REC variable can be found for 38 drugs.
RTOT variable can be found for 13 drugs.
BSTWAY variable can be found for 13 drugs.
DAYPYR variable can be found for 13 drugs.
DAYPWK variable can be found for 13 drugs.
30EST variable can be found for 11 drugs.


In [8]:
df.columns[pd.Series(df.columns).str.endswith("BSTWAY")]

Index(['ALBSTWAY', 'MRBSTWAY', 'CCBSTWAY', 'CRBSTWAY', 'HRBSTWAY', 'HLBSTWAY',
       'INBSTWAY', 'PRBSTWAY', 'OXBSTWAY', 'TRBSTWAY', 'STBSTWAY', 'MTBSTWAY',
       'SVBSTWAY'],
      dtype='object')

In [9]:
print('REC: ', list(map(lambda x: x[:-3], df.columns[pd.Series(df.columns).str.endswith("REC")])))
print('YRTOT: ', list(map(lambda x: x[:-5], df.columns[pd.Series(df.columns).str.endswith("YRTOT")])))
print('BSTWAY :', list(map(lambda x: x[:-6], df.columns[pd.Series(df.columns).str.endswith("BSTWAY")])))
print('DAYPWK: ',list(map(lambda x: x[:-6], df.columns[pd.Series(df.columns).str.endswith("DAYPWK")])))

REC:  ['CIG', 'SNF', 'CHEW', 'SLT', 'CIGAR', 'ALC', 'MJ', 'COC', 'CRAK', 'HER', 'HALL', 'LSD', 'PCP', 'ECS', 'INH', 'ANAL', 'OXYC', 'TRAN', 'STIM', 'METH', 'SED', 'HRSMK', 'HRSNF', 'HRNDL', 'MTNDL', 'MTHA', 'OSTNL', 'CONDL', 'GHB', 'ADDE', 'AMBI', 'COLD', 'KETA', 'TRYP', 'SALV', 'BLNT', 'MMTRD', 'TXLAS']
YRTOT:  ['ALC', 'MJ', 'COC', 'CRK', 'HER', 'HAL', 'INH', 'ANL', 'OXY', 'TRN', 'STM', 'MTH', 'SED']
BSTWAY : ['AL', 'MR', 'CC', 'CR', 'HR', 'HL', 'IN', 'PR', 'OX', 'TR', 'ST', 'MT', 'SV']
DAYPWK:  ['AL', 'MR', 'CC', 'CR', 'HR', 'HL', 'IN', 'PR', 'OX', 'TR', 'ST', 'MT', 'SV']


For simplicity, the YRTOT, REC, variable is ommitted.

The following 11 drugs are going to be analyzed:
- AL: alcohol
- MR: marijuana
- CC: cocain
- CR: crack 
- HR: heroin
- IN: inhalants
- PR: pain reliever
- OX: oxycontin
- TR: tranquilizer
- ST: stimulants
- MT: methamphetamine
- SV: sedatives

In [10]:
drugs = list(map(lambda x: x[:-6], df.columns[pd.Series(df.columns).str.endswith("BSTWAY")]))

### Creating the response variables

...based on BSTWAY, DAYPWK, DAYPYR variables.

In [11]:
features = ['BSTWAY', 'DAYPWK', 'DAYPYR']

In [12]:
def findValue(myKey, myDict):
    try: 
        res = myDict[myKey]
    except:
        res = None
        
    return res

In [13]:
for drug in drugs:
    df[drug + '_cat'] = None
    df[drug + '_catRange'] = None
    
    tmp = pd.DataFrame()
    for i, feature in enumerate(features):
        tmp[drug + feature +'_cat_' + str(i)] = list(map(lambda x: findValue(x, categoryLogic[feature]), df[drug + feature])) 
        df[drug + '_cat'] = np.mean(tmp, axis = 1)
        df[drug + '_catRange'] = np.max(tmp, axis = 1) - np.min(tmp, axis = 1)
               

## Data visualization

### Cross-visualization based on drug selections

## Feature selection

Given the huge number of features, feature selection needs to be performed. However, given that I have the goal - besides advanced analysis of the dataset - to create an algorithm which evaluates the risk to be drug consumer given by human input, I need to have a reasonable number of meaningful (i.e. people can fill them manually) variables. To achieve both of my goals, I followed two methods:
 * I. semi-manual feature selection for the understandable evaluating algorithm,
 * II. feature selection using algorithms for the deep analysis of the dataset.
 
 My choice of the given feature selection algorithms is based on [this article](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/).

### Method I.

### Method II.

ID           0
Age          0
Gender       0
Education    0
Country      0
Ethnicity    0
Nscore       0
Escore       0
Oscore       0
Ascore       0
Cscore       0
Impulsive    0
SS           0
Alcohol      0
Amphet       0
Amyl         0
Benzos       0
Caff         0
Cannabis     0
Choc         0
Coke         0
Crack        0
Ecstasy      0
Heroin       0
Ketamine     0
Legalh       0
LSD          0
Meth         0
Mushrooms    0
Nicotine     0
Semer        0
VSA          0
dtype: int64