### Supervised learning algorithms:

supervised learning model:

- Use FY2015 icd9 diagnosis code long desciption as training data.
- Use "Other Diagnosis" column from Excel as test data

- Import Training data, Separate input and output by slicing into 2 arrays: one is input for training, the other is target. 
- Prepare y output target. Use sklearn.preprocessing.LabelEncoder to encode the Target which is str of icd9 dx code.
- Prepare X input array. Perform feature extraction by using klearn.feature_extraction.text  CountVectorizer with common "stop words" (ENGLISH_STOP_WORDS) against icd9 dx long desc.
- Create a corpus (bag-of-words with their usage statistics) from the terms that occur answers.
- Enhanced the input variable X with a weight for every term. Use klearn.feature_extraction.text import TfidfVectorizer
- Use class sklearn.svm.SVC as estimator to fit X/y to implements vector classification.

- Prepare X-test data ("Other Diagnosis") the same way as the training datasets, using TfidfVectorizer.
- Predict X_test with the model created above.





### Import Training data


In [1]:
import pandas as pd
import numpy as np
filename = '~/Downloads/icd9dxref.fy15.txt'
df = pd.read_csv(filename, header=None, delimiter='|', dtype=str, nrows=5000)
df.shape
#print(df)

(5000, 7)

In [None]:
#df1 = df[df[1].str.contains("36")]
df1=df
print(df1)
df1.shape

In [3]:
df1=df1.values

### Array Slicing to split input and output variables
So far, so good; creating and indexing arrays looks familiar.
Now we come to array slicing, and this is one feature that causes problems for beginners to Python and NumPy arrays.
Structures like lists and NumPy arrays can be sliced. This means that a subsequence of the structure can be indexed and retrieved.
This is most useful in machine learning when specifying input variables and output variables, or splitting training rows from testing rows.
Slicing is specified using the colon operator ‘:’ with a ‘from‘ and ‘to‘ index before and after the column respectively. The slice extends from the ‘from’ index and ends one item before the ‘to’ index.

In [4]:
# separate array into input and output components
inputx = df1[:,1:3]  #or X = df1, same result
outputy = df1[:,1]
print(inputx) 
print(outputy)

[['0010' 'CHOLERA DUE TO VIBRIO CHOLERAE']
 ['0011' 'CHOLERA DUE TO VIBRIO CHOLERAE EL TOR']
 ['0019' 'CHOLERA, UNSPECIFIED']
 ...
 ['4431' "THROMBOANGIITIS OBLITERANS [BUERGER'S DISEASE]"]
 ['44321' 'DISSECTION OF CAROTID ARTERY']
 ['44322' 'DISSECTION OF ILIAC ARTERY']]
['0010' '0011' '0019' ... '4431' '44321' '44322']


### Prepare Output variable y
Use sklearn.preprocessing.LabelEncoder.
also see sklearn.preprocessing.OrdinalEncoder

In [5]:
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit(outputy)

LabelEncoder()

In [None]:
>>> list(le.classes_)

In [7]:
>>> y = le.transform(outputy)
>>> y

array([   0,    1,    2, ..., 4997, 4998, 4999])

In [8]:
>>> list(le.inverse_transform([4, 2, 1]))

['0021', '0019', '0011']

### Prepare Output variable X

In [9]:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.feature_extraction import stop_words

In [None]:
#>>> print(stop_words.ENGLISH_STOP_WORDS)
#>>> my_stop_words = stop_words.ENGLISH_STOP_WORDS.union(my_words)
#>>> vectorizer = CountVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)


In [10]:
>>> vectorizer = CountVectorizer(stop_words=stop_words.ENGLISH_STOP_WORDS)
>>> vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=frozenset({'fill', 'other', 'thereby', 'ie', 'latter', 'together', 'nine', 'perhaps', 'have', 'both', 'behind', 'eleven', 'between', 'due', 'cry', 'everything', 'eight', 'thru', 'whoever', 'none', 'serious', 'whether', 'ever', 'mine', 'else', 'thence', 'of', 'mill', 'her', 'anyone', 'on',...', 'myself', 'via', 'somewhere', 'itself', 'themselves', 'who', 'nevertheless', 'because', 'which'}),
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
#>>> t,z = data[:,0], data[:,3] 
>>> corpus = inputx[:,1]
>>> X = vectorizer.fit_transform(corpus)
>>> X

<5000x2763 sparse matrix of type '<class 'numpy.int64'>'
	with 21909 stored elements in Compressed Sparse Row format>

In [12]:
>>> vectorizer.get_feature_names()

['10th',
 '11th',
 '12th',
 '1st',
 '5q',
 '9th',
 'abdomen',
 'abdominal',
 'abducens',
 'abnormal',
 'abnormalities',
 'abnormality',
 'abortus',
 'abscess',
 'abuse',
 'academic',
 'acanthamoeba',
 'acariasis',
 'accessory',
 'accidental',
 'accommodation',
 'accommodative',
 'achieved',
 'achromatopsia',
 'acid',
 'acidosis',
 'acoustic',
 'acquired',
 'acromegaly',
 'acting',
 'actinomycotic',
 'active',
 'activity',
 'acuminatum',
 'acute',
 'adaptation',
 'adem',
 'adenopathy',
 'adenoviral',
 'adenovirus',
 'adherent',
 'adhesion',
 'adhesions',
 'adhesive',
 'adiposity',
 'adjustment',
 'adnexa',
 'adolescence',
 'adolescent',
 'adolescents',
 'adrenal',
 'adrenogenital',
 'adult',
 'adults',
 'advanced',
 'aerobacter',
 'aerogenes',
 'affect',
 'affecting',
 'affective',
 'africa',
 'african',
 'agents',
 'aggressive',
 'agitans',
 'agoraphobia',
 'ainhum',
 'air',
 'alastrim',
 'alcohol',
 'alcoholic',
 'aldosteronism',
 'aldrich',
 'alexia',
 'alkalosis',
 'allergic',
 'all

In [13]:
>>> X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [14]:
>>> vectorizer.transform(['Something completely dystrophy.']).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:

In [15]:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer_t = TfidfVectorizer(stop_words=stop_words.ENGLISH_STOP_WORDS)
>>> X = vectorizer_t.fit_transform(corpus)
>>> X.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [16]:
>>> vectorizer_t.transform(['Something completely dystrophy.']).toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

### Learning and predicting

In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).
An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.
For now, we will consider the estimator as a black box:

In [17]:
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)

In [18]:
>>> clf.fit(X, y)

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Now you can predict new values. In this case, you’ll predict using the last image from digits.data. By predicting, you’ll determine the image from the training set that best matches the last image.


In [None]:
X[:10]

In [19]:
#>>> clf.predict(X[:-1])
>>> clf.predict(X[:100])

array([  0,   1,   2,   3,   6,   6,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79, 101,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99])

In [None]:
#>>> list(le.inverse_transform(clf.predict(X[:-1])))
>>> list(le.inverse_transform(clf.predict(X[:100])))

### Prepare Sample X_test

In [21]:
>>> inputx_test = [
'Aftercare following surgery of the sense organs, NEC',
'Retinal dialysis, Repaired, Stable',
'Retinal Tear with Detachment',
'Aftercare following surgery of the sense organs, NEC',
'Retinal Tear without Detachment, Stable',
'Aftercare following surgery of the sense organs, NEC',
'Retinal Tear without Detachment, Stable',
'Aftercare following surgery of the sense organs, NEC',
'Retinal Tear with Detachment',
'Superficial keratitis, unspecified'
... ]

In [None]:
inputx_test

In [23]:
>>> X_test = vectorizer_t.transform(inputx_test)
>>> X_test.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [24]:
>>> clf.predict(X_test)

array([4881, 3720, 3737, 4881, 3737, 4881, 3737, 4881, 3737, 4073])

In [25]:
>>> list(le.inverse_transform(clf.predict(X_test)))

['4294',
 '36104',
 '3619',
 '4294',
 '3619',
 '4294',
 '3619',
 '4294',
 '3619',
 '37020']